Hive
  1. Hive
  2. HIVE-33

[Hive]: Add optimizer statistics in Hive

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor, Statistics
    • Labels:

      Description

      Add commands to collect partition and column level statistics in hive.

        Issue Links

          Activity

          Hide
          Mikhail Antonov added a comment -

          The last comment is almost 3 years old, may be it's good to clarify the status of the ticket here?

          Show
          Mikhail Antonov added a comment - The last comment is almost 3 years old, may be it's good to clarify the status of the ticket here?
          Shreepadma Venugopalan made changes -
          Summary [Hive]: Add ability to compute statistics on hive tables [Hive]: Add optimizer statistics in Hive
          Gang Tim Liu made changes -
          Link This issue relates to HIVE-3917 [ HIVE-3917 ]
          Tianxu Wang made changes -
          Summary [Hive]: Add ability to compute statistics on hive table [Hive]: Add ability to compute statistics on hive tables
          Tianxu Wang made changes -
          Summary [Hive]: Add ability to compute statistics on hive tables [Hive]: Add ability to compute statistics on hive table
          Jeff Hammerbacher made changes -
          Link This issue relates to HIVE-3027 [ HIVE-3027 ]
          Carl Steinbach made changes -
          Component/s Statistics [ 12314312 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HIVE-2472 [ HIVE-2472 ]
          Carl Steinbach made changes -
          Labels statistics
          Jeff Hammerbacher made changes -
          Link This issue relates to HIVE-1938 [ HIVE-1938 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HIVE-1940 [ HIVE-1940 ]
          Mark Thomas made changes -
          Assignee Ahmed M Aly [ ahmedaly ]
          Carl Steinbach made changes -
          Issue Type Bug [ 1 ] New Feature [ 2 ]
          Hide
          Ning Zhang added a comment -

          Patches for HIVE-1361 are ready for review. Comments are welcome!

          Show
          Ning Zhang added a comment - Patches for HIVE-1361 are ready for review. Comments are welcome!
          Hide
          Ning Zhang added a comment -

          Ahmed is almost ready to submit the patch for one of the dependent JIRA (HIVE-1361). He also put up a design doc on http://wiki.apache.org/hadoop/Hive/StatsDev. Any comments/suggestions are welcome (please comment on HIVE-1361 for table/partition level stats).

          Show
          Ning Zhang added a comment - Ahmed is almost ready to submit the patch for one of the dependent JIRA ( HIVE-1361 ). He also put up a design doc on http://wiki.apache.org/hadoop/Hive/StatsDev . Any comments/suggestions are welcome (please comment on HIVE-1361 for table/partition level stats).
          Jeff Hammerbacher made changes -
          Link This issue incorporates HIVE-1362 [ HIVE-1362 ]
          Jeff Hammerbacher made changes -
          Link This issue incorporates HIVE-1361 [ HIVE-1361 ]
          Ning Zhang made changes -
          Assignee Ahmed M Aly [ ahmedaly ]
          Ashish Thusoo made changes -
          Component/s Query Processor [ 12312586 ]
          Owen O'Malley made changes -
          Field Original Value New Value
          Issue Type New Feature [ 2 ] Bug [ 1 ]
          Assignee Ashish Thusoo [ athusoo ]
          Component/s contrib/hive [ 12312455 ]
          Project Hadoop Core [ 12310240 ] Hadoop Hive [ 12310843 ]
          Key HADOOP-4488 HIVE-33
          Hide
          Ashish Thusoo added a comment -

          all good points... comments are as follows:

          for 1. yes we can store this relatively easily - will add it.
          for 2. the number of bins is optional and not mandatory. We can store the system default as we do for the other variables in hive-conf.xml
          for 3. I am just planning to store the distinct values - no rounding or not storing them at all. Don't want to overload the semantics of this. Not sure how useful rounding is given that
          for 4. there are a number of other useful stats about strings, clearly prefixes are useful for like 'xyz%' kind of operations. We can perhaps add these later considering that we do not even have the base level stats. We can discuss this more to see what makes sense for like and regex kind of predicates.
          for 5. possible... though if we have sufficient number of bins the utility of this stat decreases. But will evaluate this nonetheless.
          for 6. implementable though computationally prohibitive and it is not very clear as to how much benefit this would give - clearly if most of the columns are weekly correlated (independent) then this is not of much use and many times that is quite true. Again this is more advanced stuff. Probably better in a follow on after the base level stats are working...

          Will also add to this list the avg size per column that you were mentioning yesterday.

          So the new list is:

          Table stats:
          1. # rows
          2. size of partition
          3. Avg size of a row
          4. # blocks
          5. # files

          Columns stats:
          1. # distinct values
          2. # null values
          3. min 3 values
          4. max 3 values
          5. histogram: frequency and height balanced.
          6. avg size of column

          Show
          Ashish Thusoo added a comment - all good points... comments are as follows: for 1. yes we can store this relatively easily - will add it. for 2. the number of bins is optional and not mandatory. We can store the system default as we do for the other variables in hive-conf.xml for 3. I am just planning to store the distinct values - no rounding or not storing them at all. Don't want to overload the semantics of this. Not sure how useful rounding is given that for 4. there are a number of other useful stats about strings, clearly prefixes are useful for like 'xyz%' kind of operations. We can perhaps add these later considering that we do not even have the base level stats. We can discuss this more to see what makes sense for like and regex kind of predicates. for 5. possible... though if we have sufficient number of bins the utility of this stat decreases. But will evaluate this nonetheless. for 6. implementable though computationally prohibitive and it is not very clear as to how much benefit this would give - clearly if most of the columns are weekly correlated (independent) then this is not of much use and many times that is quite true. Again this is more advanced stuff. Probably better in a follow on after the base level stats are working... Will also add to this list the avg size per column that you were mentioning yesterday. So the new list is: Table stats: 1. # rows 2. size of partition 3. Avg size of a row 4. # blocks 5. # files Columns stats: 1. # distinct values 2. # null values 3. min 3 values 4. max 3 values 5. histogram: frequency and height balanced. 6. avg size of column
          Hide
          Prasad Chakka added a comment -

          some comments and questions

          1- For each partition (or table for non-partitioned tables), we should store number of files as well (so we can optimize on number of mappers)

          2- We should make the number of bins optional and use default. We might need some trial and error to figure out the optional number depending on number of distinct values/rowcount.

          3- how do you do distinct values for floats? by rounding them or not store at all?

          4- for string we could store stats for some prefix of the string?

          5- in histograms, we should store number distinct values as well in the bucket.

          6- can we store correlation between two columns? it would help figuring out selectivity more accurately.

          Show
          Prasad Chakka added a comment - some comments and questions 1- For each partition (or table for non-partitioned tables), we should store number of files as well (so we can optimize on number of mappers) 2- We should make the number of bins optional and use default. We might need some trial and error to figure out the optional number depending on number of distinct values/rowcount. 3- how do you do distinct values for floats? by rounding them or not store at all? 4- for string we could store stats for some prefix of the string? 5- in histograms, we should store number distinct values as well in the bucket. 6- can we store correlation between two columns? it would help figuring out selectivity more accurately.
          Hide
          Ashish Thusoo added a comment -

          Type of statistics:
          The following types of statistics can be collected on hive partitions ->

          For each partition of the table:
          1. Number of Rows
          2. Size of the partition
          3. Average size of a row
          4. Number of blocks

          For a column in the partition:
          1. Number of distinct values
          2. Number of null values
          3. minimum 3 values
          4. maximum 3 values
          5. Histogram: Frequency histogram or a height balanced histogram (the former has equi range bins while the later has the same height for all the bins)

          The column level statistics could also be calculated for distributions in an average block

          Language Elements:
          ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS - this computes the partition level statistics
          ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR ALL COLUMNS SIZE n - this computes the column level statistics for all columns with n being the number of bins in the historgram
          ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR COLUMNS SIZE m c1 SIZE n1, c2 SIZE n2, c3 - this computes the column level statistics for columns c1 (using n1 bins for the histogram), c2(using n2 bins) and c3 (using the defaut m bins)

          We can later extend these so that these commands can work on samples and be able to extrapolate the results to the entire data set. For that we could use the ESTIMATE STATISTICS SAMPLE n ROWS or ESTIMATE STATISTICS SAMPLE n%

          e.g.

          ANALYZE TABLE <t> PARTITION(<partitionspec>) ESTIMATE STATISTICS 10%

          More details on the actual implementation to follow...

          Show
          Ashish Thusoo added a comment - Type of statistics: The following types of statistics can be collected on hive partitions -> For each partition of the table: 1. Number of Rows 2. Size of the partition 3. Average size of a row 4. Number of blocks For a column in the partition: 1. Number of distinct values 2. Number of null values 3. minimum 3 values 4. maximum 3 values 5. Histogram: Frequency histogram or a height balanced histogram (the former has equi range bins while the later has the same height for all the bins) The column level statistics could also be calculated for distributions in an average block Language Elements: ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS - this computes the partition level statistics ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR ALL COLUMNS SIZE n - this computes the column level statistics for all columns with n being the number of bins in the historgram ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR COLUMNS SIZE m c1 SIZE n1, c2 SIZE n2, c3 - this computes the column level statistics for columns c1 (using n1 bins for the histogram), c2(using n2 bins) and c3 (using the defaut m bins) We can later extend these so that these commands can work on samples and be able to extrapolate the results to the entire data set. For that we could use the ESTIMATE STATISTICS SAMPLE n ROWS or ESTIMATE STATISTICS SAMPLE n% e.g. ANALYZE TABLE <t> PARTITION(<partitionspec>) ESTIMATE STATISTICS 10% More details on the actual implementation to follow...
          Ashish Thusoo created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Ashish Thusoo
            • Votes:
              1 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

              • Created:
                Updated:

                Development