Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-24033

full outer join returns wrong number of results if hive.optimize.joinreducededuplication is enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      We encountered a hive query that returns incorrect results, when joining two CTEs on a group by value. The input tables `id_table` and
      `reference_table` are unfortunately too large to share and on smaller tables we have not been able to reproduce.

      WITH ids AS (
          SELECT
              record.id AS id
          FROM
              `id_table`
          LATERAL VIEW explode(records) r AS record
          WHERE
              record.id = '5ef0bad74d325f72f0360c19'
          LIMIT 1
      ),
      
      refs AS (
          SELECT
              reference['id'] AS referenceId
          FROM
              `reference_table`
          WHERE
          	partition_date = '2020-06-24'
              AND type = '1b0e9eb5c492d1859815410253dd79b5'
              AND reference['id'] = '5ef0bad74d325f72f0360c19'
          GROUP BY
              reference['id']
      )
      
      SELECT
          l.id AS id
          , r.referenceId AS referenceId
      FROM 
          ids l
      FULL OUTER JOIN
          refs r
      ON
          l.id = r.referenceId
      

      This returns 2 rows, because the join clause misses:

      OK
      5ef0bad74d325f72f0360c19        NULL
      NULL    5ef0bad74d325f72f0360c19
      

      Instead, a single row should be returned. The correct behavior can be achieved by either

      • calling lower() on the refs group by statement (doesn't change the string contents)
      • setting hive.optimize.joinreducededuplication=false

      Attachments

        1. failing_query_plan_formatted.json
          8 kB
          Sebastian Klemke
        2. failing_query_plan.txt
          3 kB
          Sebastian Klemke

        Activity

          People

            Unassigned Unassigned
            packet Sebastian Klemke
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: