Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2178

Filtering a source and then merging the filtered rows only generates data from one half of the filtering

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 0.8.1
    • 0.8.1
    • impl
    • None

    Description

      Pig is generating a plan that eliminates half of input data when using FILTER BY

      To better illustrate, I created a small test case.
      1. Create a file in HDFS called "/testinput"
      The contents of the file should be:
      "1\ta\taline\n1\tb\tbline"
      2. Run the following pig script:
      ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
      – Split into two inputs based on the value of child_id
      A = FILTER ORIG BY child_id =='a';
      B = FILTER ORIG BY child_id =='b';
      – Project out the column which chooses the correct data set
      APROJ = FOREACH A GENERATE parent_id, value;
      BPROJ = FOREACH B GENERATE parent_id, value;
      – Merge both datasets by parent id
      ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
      – Project the result
      ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
      DUMP ABPROJ;
      3. The resulting tuple will be
      (1,aline,aline)

      Attachments

        Activity

          People

            Unassigned Unassigned
            dwollen Derek Wollenstein
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: