Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.11.1
-
None
-
None
Description
Verified the problem in 0.11.1. In short, filter should not be pushed before a nested foreach in which another filter operator is present. See the following minimum example:
cat data; (1, {(1000, 'a'), (1001, 'b')}) (2, {(2000, 'a'), (2001, 'b'), (2002, 'c')}) A = load 'data' as (id:int, hits:{(score:int, name:chararray)}); B = foreach A { filtered = filter hits by score > 2000; generate id, filtered; }; dump B; (1,{}) (2,{(2001,'b'),(2002,'c')}) C = filter B by SIZE(filtered) > 0; dump C; (1,{}) (2,{(2001,'b'),(2002,'c')})
The desired result can be achieved with either '-optimizer_off PushUpFilter' when invoking Pig, or using the following convoluted way:
C = foreach B generate SIZE(filtered) as size, id, filtered; D = filter C by size > 0; E = foreach D generate id, filtered; dump E; (2,{(2001,'b'),(2002,'c')})