1. Pig
  2. PIG-1474

Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple


    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:


      Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

      The write() and readFields() functions of tuple returned by TupleFactory is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce .
      To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property "pig.data.tuple.factory.name" .
      This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach.


        Thejas M Nair created issue -
        Thejas M Nair made changes -
        Field Original Value New Value
        Fix Version/s 0.9.0 [ 12315191 ]
        Fix Version/s 0.8.0 [ 12314562 ]
        Olga Natkovich made changes -
        Fix Version/s 0.9.0 [ 12315191 ]


          • Assignee:
            Thejas M Nair
            Thejas M Nair
          • Votes:
            0 Vote for this issue
            0 Start watching this issue


            • Created: