Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
0.8.0
-
None
-
None
Description
Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.
Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .
The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.
Example of query where this will help -
l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 > 5; dump fil; -- Serialization of column b can be delayed until here using this approach .