Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-28080

Propagate statistics from a source table to the materialized CTE

    XMLWordPrintableJSON

Details

    Description

      Hive doesn't fill in the statistics of materialized CTEs, and the size of those TableScans is underestimated. That causes Tez to run with fewer tasks or to fail with OOM because MapJoin could be wrongly applied.

       

      The following example shows Map 1 reading `src` generates 493 rows, but Map 3 reading `cte` is expected to scan only 1 row.

      0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN WITH cte AS (
      . . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
      . . . . . . . . . . . . . . . . . . . . . . .> )
      . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
      . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
      . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
      . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
      ...
      +----------------------------------------------------+
      |                      Explain                       |
      +----------------------------------------------------+
      | Plan optimized by CBO.                             |
      |                                                    |
      | Vertex dependency in Stage-4                       |
      | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
      | Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
      |                                                    |
      | Stage-3                                            |
      |   Fetch Operator                                   |
      |     limit:-1                                       |
      |     Stage-4                                        |
      |       Map 2 vectorized                             |
      |       File Output Operator [FS_69]                 |
      |         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
      |           Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"] |
      |         <-Map 3 [BROADCAST_EDGE] vectorized        |
      |           SHUFFLE [RS_61]                          |
      |             PartitionCols:_col0                    |
      |             Select Operator [SEL_60] (rows=1 width=368) |
      |               Output:["_col0","_col1"]             |
      |               Filter Operator [FIL_59] (rows=1 width=368) |
      |                 predicate:key is not null          |
      |                 TableScan [TS_11] (rows=1 width=368) |
      |                   default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
      |         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
      |             Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"] |
      |           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
      |             BROADCAST [RS_64]                      |
      |               PartitionCols:_col0                  |
      |               Select Operator [SEL_63]             |
      |                 Output:["_col0","_col1"]           |
      |           <-Select Operator [SEL_66] (rows=1 width=368) |
      |               Output:["_col0","_col1"]             |
      |               Filter Operator [FIL_65] (rows=1 width=368) |
      |                 predicate:key is not null          |
      |                 TableScan [TS_5] (rows=1 width=368) |
      |                   default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
      |         Stage-2                                    |
      |           Dependency Collection{}                  |
      |             Stage-1                                |
      |               Map 1 vectorized                     |
      |               File Output Operator [FS_4]          |
      |                 table:{"name:":"default.cte"}      |
      |                 Select Operator [SEL_3] (rows=493 width=350) |
      |                   Output:["_col0","_col1"]         |
      |                   TableScan [TS_0] (rows=493 width=350) |
      |                     default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
      |         Stage-0                                    |
      |           Move Operator                            |
      |              Please refer to the previous Stage-1  |
      |                                                    |
      +----------------------------------------------------+ 

      Attachments

        Issue Links

          Activity

            People

              okumin Shohei Okumiya
              okumin Shohei Okumiya
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: