Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15122

Hive: Upcasting types should not obscure stats (min/max/ndv)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • None
    • None

    Description

      A UDFToLong breaks PK/FK inferences and triggers mis-estimation of joins in LLAP.

      Snippet from the bad plan.

      | STAGE PLANS:                                                                                                                                                             |
      |   Stage: Stage-1                                                                                                                                                         |
      |     Tez                                                                                                                                                                  |
      |       DagId: hive_20161031222730_a700058f-78eb-40d6-a67d-43add60a50e2:6                                                                                                  |
      |       Edges:                                                                                                                                                             |
      |         Map 2 <- Map 1 (BROADCAST_EDGE)                                                                                                                                  |
      |         Map 3 <- Map 2 (BROADCAST_EDGE)                                                                                                                                  |
      |         Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE), Map 7 (CUSTOM_SIMPLE_EDGE), Map 8 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE)                                              |
      |         Reducer 5 <- Reducer 4 (SIMPLE_EDGE)                                                                                                                             |
      |         Reducer 6 <- Reducer 5 (SIMPLE_EDGE)                                                                                                                             |
      |       DagName:                                                                                                                                                           |
      |       Vertices:                                                                                                                                                          |
      |         Map 1                                                                                                                                                            |
      |             Map Operator Tree:                                                                                                                                           |
      |                 TableScan                                                                                                                                                |
      |                   alias: supplier                                                                                                                                        |
      |                   filterExpr: (s_suppkey is not null and s_nationkey is not null) (type: boolean)                                                                        |
      |                   Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE                                                       |
      |                   Filter Operator                                                                                                                                        |
      |                     predicate: (s_suppkey is not null and s_nationkey is not null) (type: boolean)                                                                       |
      |                     Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE                                                     |
      |                     Select Operator                                                                                                                                      |
      |                       expressions: s_suppkey (type: bigint), s_nationkey (type: bigint)                                                                                  |
      |                       outputColumnNames: _col0, _col1                                                                                                                    |
      |                       Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE                                                   |
      |                       Reduce Output Operator                                                                                                                             |
      |                         key expressions: _col0 (type: bigint)                                                                                                            |
      |                         sort order: +                                                                                                                                    |
      |                         Map-reduce partition columns: _col0 (type: bigint)                                                                                               |
      |                         Statistics: Num rows: 10000000 Data size: 160000000 Basic stats: COMPLETE Column stats: COMPLETE                                                 |
      |                         value expressions: _col1 (type: bigint)                                                                                                          |
      |             Execution mode: vectorized, llap                                                                                                                             |
      |             LLAP IO: all inputs                                                                                                                                          |
      |         Map 2                                                                                                                                                            |
      |             Map Operator Tree:                                                                                                                                           |
      |                 TableScan                                                                                                                                                |
      |                   alias: lineitem                                                                                                                                        |
      |                   filterExpr: (l_suppkey is not null and l_orderkey is not null) (type: boolean)                                                                         |
      |                   Statistics: Num rows: 2285121364 Data size: 63983407882 Basic stats: COMPLETE Column stats: PARTIAL                                                    |
      |                   Filter Operator                                                                                                                                        |
      |                     predicate: (l_suppkey is not null and l_orderkey is not null) (type: boolean)                                                                        |
      |                     Statistics: Num rows: 2285121364 Data size: 127966796384 Basic stats: COMPLETE Column stats: PARTIAL                                                 |
      |                     Select Operator                                                                                                                                      |
      |                       expressions: l_orderkey (type: bigint), l_suppkey (type: int), l_extendedprice (type: double), l_discount (type: double), l_shipdate (type: date)  |
      |                       outputColumnNames: _col0, _col1, _col2, _col3, _col4                                                                                               |
      |                       Statistics: Num rows: 2285121364 Data size: 127966796384 Basic stats: COMPLETE Column stats: PARTIAL                                               |
      |                       Map Join Operator                                                                                                                                  |
      |                         condition map:                                                                                                                                   |
      |                              Inner Join 0 to 1                                                                                                                           |
      |                         keys:                                                                                                                                            |
      |                           0 _col0 (type: bigint)                                                                                                                         |
      |                           1 UDFToLong(_col1) (type: bigint)                                                                                                              |
      |                         outputColumnNames: _col1, _col2, _col4, _col5, _col6                                                                                             |
      |                         input vertices:                                                                                                                                  |
      |                           0 Map 1                                                                                                                                        |
      |                         Statistics: Num rows: 10000000 Data size: 880000000 Basic stats: COMPLETE Column stats: PARTIAL                                                  |
      |                         Reduce Output Operator                                                                                                                           |
      |                           key expressions: _col2 (type: bigint)                                                                                                          |
      |                           sort order: +                                                                                                                                  |
      |                           Map-reduce partition columns: _col2 (type: bigint)                                                                                             |
      |                           Statistics: Num rows: 10000000 Data size: 880000000 Basic stats: COMPLETE Column stats: PARTIAL                                                |
      |                           value expressions: _col1 (type: bigint), _col4 (type: double), _col5 (type: double), _col6 (type: date)                                        |
      |             Execution mode: vectorized, llap                                                                                                                             |
      |             LLAP IO: all inputs                                                                                                                                          |
      |         Map 3                                                                                                                                                            |
      |             Map Operator Tree:                                                                                                                                           |
      |                 TableScan                                                                                                                                                |
      |                   alias: orders                                                                                                                                          |
      |                   filterExpr: (o_orderkey is not null and o_custkey is not null) (type: boolean)                                                                         |
      |                   Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats: COMPLETE Column stats: NONE                                                       |
      |                   Filter Operator                                                                                                                                        |
      |                     predicate: (o_orderkey is not null and o_custkey is not null) (type: boolean)                                                                        |
      |                     Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats: COMPLETE Column stats: NONE                                                     |
      |                     Select Operator                                                                                                                                      |
      |                       expressions: o_orderkey (type: int), o_custkey (type: bigint)                                                                                      |
      |                       outputColumnNames: _col0, _col1                                                                                                                    |
      |                       Statistics: Num rows: 4318801126 Data size: 51825626753 Basic stats: COMPLETE Column stats: NONE                                                   |
      |                       Map Join Operator                                                                                                                                  |
      |                         condition map:                                                                                                                                   |
      |                              Inner Join 0 to 1                                                                                                                           |
      |                         keys:                                                                                                                                            |
      |                           0 _col2 (type: bigint)                                                                                                                         |
      |                           1 UDFToLong(_col0) (type: bigint)                                                                                                              |
      |                         outputColumnNames: _col1, _col4, _col5, _col6, _col8                                                                                             |
      |                         input vertices:                                                                                                                                  |
      |                           0 Map 2                                                                                                                                        |
      |                         Statistics: Num rows: 4750681341 Data size: 57008190663 Basic stats: COMPLETE Column stats: NONE                                                 |
      |                         Reduce Output Operator                                                                                                                           |
      |                           key expressions: _col8 (type: bigint)                                                                                                          |
      |                           sort order: +                                                                                                                                  |
      |                           Map-reduce partition columns: _col8 (type: bigint)                                                                                             |
      |                           Statistics: Num rows: 4750681341 Data size: 57008190663 Basic stats: COMPLETE Column stats: NONE                                               |
      |                           value expressions: _col1 (type: bigint), _col4 (type: double), _col5 (type: double), _col6 (type: date)                                        |
      |             Execution mode: vectorized, llap                                                                                                                             |
      |             LLAP IO: all inputs                                                                                                                                          |
      |         Map 7                                                                                                                                 
      

      Note the Map2 to Map3 output.

      This causes a rather large join (120GB) to be categorized as a map-join.

      Attachments

        1. HIVE-15122.patch
          3 kB
          jcamachorodriguez
        2. HIVE-15122.03.patch
          22 kB
          jcamachorodriguez

        Activity

          People

            jcamacho Jesús Camacho Rodríguez
            sseth Siddharth Seth
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: