Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1473

Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 0.8.0
    • 0.8.0
    • None
    • None

    Description

      Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance.

      Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes .

      The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called.

      Example of query where this will help -

      l = LOAD 'file1' AS (a : int, b : map [ ]);
      f = FOREACH l GENERATE udf1(a), b;      
      fil = FILTER f BY $0 > 5;
      dump fil; -- Serialization of column b can be delayed until here using this approach .
      
      

      Attachments

        Activity

          People

            thejas Thejas Nair
            thejas Thejas Nair
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: