[HIVE-19889] Wrong results due to PPD of non deterministic functions with CBO - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

The following query can give wrong results when CBO is on:

select * from (
select part1,randum123
from (SELECT *, cast(rand() as double) AS randum123 FROM testA where part1='CA' and part2 = 'ABC') a
where randum123 <= 0.5) s where s.randum123 > 0.25 limit 20;

The plan of the query is as follows:
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: testa
            Statistics: Num rows: 2 Data size: 4580 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)
              Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: 'CA' (type: string), rand() (type: double)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                Limit
                  Number of rows: 20
                  Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 2290 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: 20
      Processor Tree:
        ListSink

The relevant part in the plan is the filter:

            Filter Operator
              predicate: ((rand() <= 0.5D) and (rand() > 0.25D)) (type: boolean)

The predicates randum123 <= 0.5 and s.randum123 > 0.25 were pushed down. And randum123 was resolved to rand(). This is bad because it will result in invocation of rand() two times and rand() UDF is non-deterministic. Both the rand calls can generate values that can satisfy the predicates independently, but not together, whereas the original intention of the query is to give results when rand falls between 0.25 and 0.5.

A sample result:

CA	0.9191984370369802
CA	0.397933021566812

where the condition was not satisfied.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-19889.2.patch
16/Jun/18 00:47
29 kB
Janaki Lahorani
HIVE-19889.1.patch
14/Jun/18 00:12
13 kB
Janaki Lahorani

Issue Links

links to

Activity

People

Assignee:: Janaki Lahorani

Reporter:: Janaki Lahorani

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Jun/18 00:05

Updated:: 31/Jul/18 01:33

Resolved:: 21/Jun/18 20:17