Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23891

UNION ALL and multiple task attempts can cause file duplication

    XMLWordPrintableJSON

Details

    Description

      Hello, 

      the specific scenario when this can happen:

      • the execution engine is Tez;
      • speculative execution is on;
      • the query inserts into a table and the last step is a UNION sql clause;

      The problem is that Tez creates an extra layer of subdirectories when there is a UNION. Later, when deduplicating, Hive doesn't take that into account and only deduplicates folders but not the files inside.

      So for a query like this:

      insert overwrite table union_all
          select * from union_first_part
      union all
          select * from union_second_part;
      

      The folder structure afterwards will be like this (a possible example):

      .../union_all/HIVE_UNION_SUBDIR_1/000000_0
      .../union_all/HIVE_UNION_SUBDIR_1/000000_1
      .../union_all/HIVE_UNION_SUBDIR_2/000000_1
      

      The attached patch increases the number of folder levels that Hive will check recursively for duplicates when we have a UNION in Tez.

      Feel free to reach out if you have any questions .

      Attachments

        1. HIVE-23891.1.patch
          6 kB
          George Pachitariu

        Issue Links

          Activity

            People

              dengzh Zhihua Deng
              george.pachitariu George Pachitariu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m