[HIVE-24033] full outer join returns wrong number of results if hive.optimize.joinreducededuplication is enabled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

We encountered a hive query that returns incorrect results, when joining two CTEs on a group by value. The input tables `id_table` and
`reference_table` are unfortunately too large to share and on smaller tables we have not been able to reproduce.

WITH ids AS (
    SELECT
        record.id AS id
    FROM
        `id_table`
    LATERAL VIEW explode(records) r AS record
    WHERE
        record.id = '5ef0bad74d325f72f0360c19'
    LIMIT 1
),

refs AS (
    SELECT
        reference['id'] AS referenceId
    FROM
        `reference_table`
    WHERE
    	partition_date = '2020-06-24'
        AND type = '1b0e9eb5c492d1859815410253dd79b5'
        AND reference['id'] = '5ef0bad74d325f72f0360c19'
    GROUP BY
        reference['id']
)

SELECT
    l.id AS id
    , r.referenceId AS referenceId
FROM 
    ids l
FULL OUTER JOIN
    refs r
ON
    l.id = r.referenceId

This returns 2 rows, because the join clause misses:

OK
5ef0bad74d325f72f0360c19        NULL
NULL    5ef0bad74d325f72f0360c19

Instead, a single row should be returned. The correct behavior can be achieved by either

calling lower() on the refs group by statement (doesn't change the string contents)
setting hive.optimize.joinreducededuplication=false

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

failing_query_plan_formatted.json
18/Aug/20 10:11
8 kB
Sebastian Klemke
failing_query_plan.txt
12/Aug/20 13:10
3 kB
Sebastian Klemke

Activity

People

Assignee:: Unassigned

Reporter:: Sebastian Klemke

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Aug/20 13:08

Updated:: 18/Aug/20 10:22