Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.2.0
-
None
-
None
-
Reviewed
Description
KeyTypeDiscoveryVisitor visits the map plan to figure out the datatype of the map key. This is required so that when the map key is null, we can still construct a valid NullableXXXWritable object to pass on to hadoop in the collect() call (hadoop needs a valid object even for null objects). Currently the KeyTypeDiscoveryVisitor only looks at POPackage and POLocalRearrange to figure out the key type. In a pig script which results in multiple Map reduce jobs, one of the jobs could have a map plan with only POLoads in it. In such a case, the map key type is not discovered and this results in a null being returned from HDataType.getWritableComparableTypes() method. This in turn will result in a NullPointerException in the collect().
Here is a script which can prompt this behavior:
a = load 'a.txt' as (x:int, y:int, z:int); b = load 'b.txt' as (x:int, y:int); b_group = group b by x; b_sum = foreach b_group generate flatten(group) as x, SUM(b.y) as clicks; a_group = group a by (x, y); a_aggs = foreach a_group { generate flatten(group) as (x, y), SUM(a.z) as zs; }; join_a_b = join b_sum by x, a_aggs by x; --> the map plan for this join will only have two POLoads which will result in the NullPointerException at runtime in collect() dump join_a_b;
Contents of a.txt (columns are tab separated):
The first column of the first two rows is null (represented by an empty column)
7 8 8 9 1 20 30 1 20 40
Contents of b.txt (columns are tab separated):
7 2 1 5 1 10