Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1474

Avoid serialization/deserialization costs for PigStorage data - Use custom Tuple

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.8.0
    • None
    • None
    • None

    Description

      Avoid sedes when possible for data loaded using PigStorage by implementing approach #4 proposed in http://wiki.apache.org/pig/AvoidingSedes .

      The write() and readFields() functions of tuple returned by TupleFactory is used to serialize data between Map and Reduce. By using a tuple that knows the serialization format of the loader, we avoid sedes at Map Recue boundary and use the load functions serialized format between Map and Reduce .
      To use a new custom tuple for this purpose, a custom TupleFactory that returns tuples of this type has to be specified using the property "pig.data.tuple.factory.name" .
      This approach will work only for a set of load functions in the query that share same serialization format for map and bags. If this approach proves to be very useful, it will build a case for more extensible approach.

      Attachments

        Activity

          People

            thejas Thejas Nair
            thejas Thejas Nair
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: