Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20912

Output data might be duplicated while speculation is enabled



    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.2.1
    • Fix Version/s: None
    • Component/s: Hive, Operators
    • Labels:
    • Environment:

      Hive 1.2.1

      Hadoop 2.7.3

      Tez 0.7.0


      The file merge stage had two tasks, which should create two files, but there was three files created.

      By tracing the log, we found that there were two task attempts(one of them was a speculation) finished in one second by such a coincidence. Although the later one received a kill signal from AM, the rename operation was already done at that time, which cause the data duplication.

      The rename operation was done at AbstractFileMergeOperator.closeOp(), the __ final path name was determined by the task attempt id rather than the task id. In this case, the final path ended with '000000_0' and '000000_1' rather than '000000'. IMHO, by making the final path name ended with task id without task attempt id, one task can only generate at most one file, which could solve this issue. But I don't know the side effects for changing the final path name.

      This issue also affects other operators related to file renaming like JoinOperator and FileSinkOperator.


        1. image-2018-11-14-17-48-59-826.png
          91 kB
          Zihao Ye
        2. image-2018-11-14-17-53-13-191.png
          73 kB
          Zihao Ye
        3. image-2018-11-14-17-53-50-171.png
          61 kB
          Zihao Ye
        4. image-2018-11-14-19-28-18-924.png
          91 kB
          Zihao Ye



            • Assignee:
              zihao.ye Zihao Ye
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: