Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-19889

Wrong results due to PPD of non deterministic functions with CBO

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The following query can give wrong results when CBO is on:

      select * from (
      select part1,randum123
      from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' and part2 = 'ABC') a
      where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;
      
      The plan of the query is as follows:
      STAGE PLANS:
        Stage: Stage-1
          Map Reduce
            Map Operator Tree:
                TableScan
                  alias: testa
                  Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE Column stats: NONE
                  Filter Operator
                    predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
                    Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                    Select Operator
                      expressions: 'CA' (type: string), rand() (type: double)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                      Limit
                        Number of rows: 20
                        Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                          table:
                              input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                              output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      
        Stage: Stage-0
          Fetch Operator
            limit: 20
            Processor Tree:
              ListSink
      

      The relevant part in the plan is the filter:

                  Filter Operator
                    predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
      

      The predicates randum123 <= 0.5 and s.randum123 > 0.25 were pushed down. And randum123 was resolved to rand(). This is bad because it will result in invocation of rand() two times and rand() UDF is non-deterministic. Both the rand calls can generate values that can satisfy the predicates independently, but not together, whereas the original intention of the query is to give results when rand falls between 0.25 and 0.5.

      A sample result:

      CA	0.9191984370369802
      CA	0.397933021566812
      

      where the condition was not satisfied.

        Attachments

        1. HIVE-19889.1.patch
          13 kB
          Janaki Lahorani
        2. HIVE-19889.2.patch
          29 kB
          Janaki Lahorani

          Issue Links

            Activity

              People

              • Assignee:
                janulatha Janaki Lahorani
                Reporter:
                janulatha Janaki Lahorani
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: