Uploaded image for project: 'HCatalog'
  1. HCatalog
  2. HCATALOG-577

HCatContext causes persistance of undesired jobConf parameters

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.5
    • 0.5
    • None
    • None

    Description

      I've found a fairly interesting bug while experimenting with an e2e test case.

      Consider the following pig query :

      a = load 'studenttab10k' using org.apache.hcatalog.pig.HCatLoader();
      b = foreach a generate name;
      c = distinct b;
      d = group c all;
      e = foreach d generate $1 as a;
      store e into 'pig_complex_6' using org.apache.hcatalog.pig.HCatStorer();
      exec;
      f = load 'pig_complex_6' using org.apache.hcatalog.pig.HCatLoader();
      g = foreach f generate flatten(a);
      

      Now, with this query, we wind up grouping names into an array<string> in one line.

      Say the result was supposed to say:

      {(bob king),(bob ovid),(bob polk)}

      what we actually get is:

      {(bob king)}

      The interesting thing about this is that after "e" gets generated, when written out using HCatStorer, it has the abovementioned problem. If, however, we store "e" using PigStorage, and then, in another pig job, we load e and execute the rest, it works.
      On comparing jobConfs of the two stores, one using HCatStorer and PigStorage, the important difference we noticed was that in the HCatStorer case, we have an extra key, mapreduce.combine.class with value "org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.DistinctCombiner$Combine" On looking at that, we see that it basically just picks the first entry from the bag, to perform a "distinct" operation. This was injected by pig on to the previous load job done by HCatLoader as we perform a distinct operation on "b" to get "c", but since HCat tries to store JobConfs so as to be usable across multiple setLocation calls (and to cache things like tokens), we wind up with the previous job's JobConf as well, thus resulting in the distinct being applied to the HCatStorer output as well.

      This is bad behaviour, and we need to clear out HCatContext.INSTANCE between pig Loader / Storer executions.

      Attachments

        1. HCAT-577.patch
          0.7 kB
          Alan Gates

        Activity

          People

            sushanth Sushanth Sowmyan
            sushanth Sushanth Sowmyan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: