[PIG-2178] Filtering a source and then merging the filtered rows only generates data from one half of the filtering - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 0.8.1
Fix Version/s: 0.8.1
Component/s: impl
Labels:
None

Description

Pig is generating a plan that eliminates half of input data when using FILTER BY

To better illustrate, I created a small test case.
1. Create a file in HDFS called "/testinput"
The contents of the file should be:
"1\ta\taline\n1\tb\tbline"
2. Run the following pig script:
ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
– Split into two inputs based on the value of child_id
A = FILTER ORIG BY child_id =='a';
B = FILTER ORIG BY child_id =='b';
– Project out the column which chooses the correct data set
APROJ = FOREACH A GENERATE parent_id, value;
BPROJ = FOREACH B GENERATE parent_id, value;
– Merge both datasets by parent id
ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
– Project the result
ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
DUMP ABPROJ;
3. The resulting tuple will be
(1,aline,aline)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Derek Wollenstein

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Jul/11 04:57

Updated:: 21/Jul/11 05:23

Resolved:: 21/Jul/11 05:20