Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-8394

HIVE-7803 doesn't handle Pig MultiQuery, can cause data-loss.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.12.0, 0.13.1, 0.14.0
    • 0.14.0
    • HCatalog
    • None

    Description

      We've found situations in production where Pig queries using HCatStorer, dynamic partitioning and opt.multiquery=true that produce partitions in the output table, but the corresponding directories have no data files (in spite of Pig reporting non-zero records written to HDFS). I don't yet have a distilled test-case for this.

      Here's the code from FileOutputCommitterContainer after HIVE-7803:

      FileOutputCommitterContainer.java
        @Override
        public void commitTask(TaskAttemptContext context) throws IOException {
          String jobInfoStr = context.getConfiguration().get(FileRecordWriterContainer.DYN_JOBINFO);
          if (!dynamicPartitioningUsed) {
               //See HCATALOG-499
            FileOutputFormatContainer.setWorkOutputPath(context);
            getBaseOutputCommitter().commitTask(HCatMapRedUtil.createTaskAttemptContext(context));
          } else if (jobInfoStr != null) {
            ArrayList<String> jobInfoList = (ArrayList<String>)HCatUtil.deserialize(jobInfoStr);
            org.apache.hadoop.mapred.TaskAttemptContext currTaskContext = HCatMapRedUtil.createTaskAttemptContext(context);
            for (String jobStr : jobInfoList) {
          	OutputJobInfo localJobInfo = (OutputJobInfo)HCatUtil.deserialize(jobStr);
          	FileOutputCommitter committer = new FileOutputCommitter(new Path(localJobInfo.getLocation()), currTaskContext);
          	committer.commitTask(currTaskContext);
            }
          }
        }
      

      The serialized jobInfoList can't be retrieved, and hence the commit never completes. This is because Pig's MapReducePOStoreImpl deliberately clones both the TaskAttemptContext and the contained Configuration instance, thus separating the Configuration instances passed to FileOutputCommitterContainer::commitTask() and FileRecordWriterContainer::close(). Anything set by the RecordWriter is unavailable to the Committer.

      One approach would have been to store state in the FileOutputFormatContainer. But that won't work since this is constructed via reflection in HCatOutputFormat (itself constructed via reflection by PigOutputFormat via HCatStorer). There's no guarantee that the instance is preserved.

      My only recourse seems to be to use a Singleton to store shared state. I'm loath to indulge in this brand of shenanigans. (Statics and container-reuse in Tez might not play well together, for instance.) It might work if we're careful about tearing down the singleton.

      Any other ideas?

      Attachments

        1. HIVE-8394.1.patch
          8 kB
          Mithun Radhakrishnan
        2. HIVE-8394.2.patch
          9 kB
          Mithun Radhakrishnan
        3. HIVE-8394.3.patch
          10 kB
          Mithun Radhakrishnan
        4. HIVE-8394.4.patch
          11 kB
          Mithun Radhakrishnan

        Issue Links

          Activity

            People

              mithun Mithun Radhakrishnan
              mithun Mithun Radhakrishnan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: