Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21915

Hive with TEZ UNION ALL and UDTF results in data loss

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.1
    • Fix Version/s: 4.0.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      The HQL syntax is like this:

      CREATE TEMPORARY TABLE tez_union_all_loss_data AS
      SELECT xxx, yyy, zzz,1 as tag
      FROM ods_1

      UNION ALL

      SELECT xxx, yyy, zzz, tag
      FROM
      (
      SELECT xxx
      ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
      ,zzz
      ,2 as tag
      FROM ods_2
      LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
      ) tbl
      ;

       

      With above HQL, we are expecting that rows with both tag = 2 and tag = 1 appear. In our case however, all the rows with tag = 1 are lost.

      Dig deeper we can find that the generated two maps have identical task tmp paths. And that results from when UDTF is present, the FileSinkOperator will be processed twice generating the tmp path in GenTezUtils.removeUnionOperators();

       

        Attachments

        1. HIVE-21915.01.patch
          1 kB
          Wei Zhang
        2. HIVE-21915.02.patch
          10 kB
          Wei Zhang
        3. HIVE-21915.03.patch
          12 kB
          Wei Zhang
        4. HIVE-21915.04.patch
          10 kB
          Wei Zhang

          Activity

            People

            • Assignee:
              zhangweilst Wei Zhang
              Reporter:
              zhangweilst Wei Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: