Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21915

Hive with TEZ UNION ALL and UDTF results in data loss

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.2.1
    • 4.0.0-alpha-1
    • Query Processor
    • None

    Description

      The HQL syntax is like this:

      CREATE TEMPORARY TABLE tez_union_all_loss_data AS
      SELECT xxx, yyy, zzz,1 as tag
      FROM ods_1

      UNION ALL

      SELECT xxx, yyy, zzz, tag
      FROM
      (
      SELECT xxx
      ,get_json_object(get_json_object(tb,'$.a'),'$.b') AS yyy
      ,zzz
      ,2 as tag
      FROM ods_2
      LATERAL VIEW EXPLODE(some_udf(uuu)) team_number AS tb
      ) tbl
      ;

       

      With above HQL, we are expecting that rows with both tag = 2 and tag = 1 appear. In our case however, all the rows with tag = 1 are lost.

      Dig deeper we can find that the generated two maps have identical task tmp paths. And that results from when UDTF is present, the FileSinkOperator will be processed twice generating the tmp path in GenTezUtils.removeUnionOperators();

       

      Attachments

        1. HIVE-21915.01.patch
          1 kB
          Wei Zhang
        2. HIVE-21915.02.patch
          10 kB
          Wei Zhang
        3. HIVE-21915.03.patch
          12 kB
          Wei Zhang
        4. HIVE-21915.04.patch
          10 kB
          Wei Zhang

        Issue Links

          Activity

            People

              zhangweilst Wei Zhang
              zhangweilst Wei Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: