Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2178

Filtering a source and then merging the filtered rows only generates data from one half of the filtering

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.8.1
    • Fix Version/s: 0.8.1
    • Component/s: impl
    • Labels:
      None

      Description

      Pig is generating a plan that eliminates half of input data when using FILTER BY

      To better illustrate, I created a small test case.
      1. Create a file in HDFS called "/testinput"
      The contents of the file should be:
      "1\ta\taline\n1\tb\tbline"
      2. Run the following pig script:
      ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
      – Split into two inputs based on the value of child_id
      A = FILTER ORIG BY child_id =='a';
      B = FILTER ORIG BY child_id =='b';
      – Project out the column which chooses the correct data set
      APROJ = FOREACH A GENERATE parent_id, value;
      BPROJ = FOREACH B GENERATE parent_id, value;
      – Merge both datasets by parent id
      ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
      – Project the result
      ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
      DUMP ABPROJ;
      3. The resulting tuple will be
      (1,aline,aline)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dwollen Derek Wollenstein
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: