Pig
  1. Pig
  2. PIG-3429

Reduce Pig memory footprint using specialized tuple classes (complementary to SchemaTuple)

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.12.0
    • Fix Version/s: None
    • Component/s: data
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Pig's default tuple implementation is very memory inefficient for small tuples, as the minimum size of an empty tuple is 96 bytes. This leads to bags being spilled more often than they need to. SchemaTuple addresses this, but is not fully integrated into the PhysicalPlan pipeline (and seems like it would be difficult to do so). Furthermore, it is likely that almost all UDFs do not use SchemaTuple.

      This patch therefore provides some basic optimizations to reduce memory footprint of tuples by having BinSedesTupleFactory construct specialized tuple implementations in certain circumstances. This way, anything using BinSedesTupleFactory will reap the benefits, and since SchemaTuple uses a different factory, it will not be interfered with.

      There is a long description below, because this patch might break stuff. I tried to think through possible implementation hazards which I will list.

      The specialized tuple implementations are as follows:

      EmptyTuple // no fields, just an object header = 8 bytes
      NullWrapperTuple // wraps a single null field, 8 bytes
      CountingTuple // replaces (1L) as initial output of COUNT, 8 bytes

      IntegerWrapperTuple // these all wrap a single primitive field
      LongWrapperTuple // object header + rounded primitive size = 16 bytes
      FloatWrapperTuple
      DoubleWrapperTuple

      BinSedesTuple2 // these are pair/triples of fields with no ArrayList
      BinSedesTuple3 // 16/24 bytes of overhead as opposed to 80 from ArrayList

      The memory savings are greatest for the algebraic math functions COUNT, SUM, etc. For example, the size of an intermediate tuple for COUNT should go from 112 bytes to 8 bytes. The size of an intermediate tuple from SUM should go from 112 bytes to 16 bytes.

      I haven't finished running the full unit-tests, but TestAlgebraicEval passes so I'm hopeful it will be manageable to debug.

      The three concerns that I have are:
      1) Since TupleFactory now sometimes outputs non-appendable tuples, the isFixedSize() method had to be removed. A file search didn't show it being used anywhere though. I think appending to tuples instead of finding out the requisite size ahead of time is bad practice as well (I changed POForeach to do the latter so it can take advantage of the special tuple impls).
      2) Also since TupleFactory now has multiple tuple types, the tupleClass() method gets tricky. I made a superclass GenericBinSedesTuple that all the specialized classes inherit from, and it seems to work, but I'm not sure what the implications of this are. It breaks the inheritance tree of AbstractTuple <-- DefaultTuple <-- BinSedesTuple, so now "DefaultBinSedesTuple" inherits directly from GenericBinSedesTuple and DefaultTuple is left unused. In the patch, all the stuff for DefaultBinSedesTuple is just copied over from the old DefaultTuple.
      3) I tried to be careful not to break BinInterSedesTupleRawComparator, but this will need verification.

      Finally,
      4) For my personal use cases, I'd like to make custom tuple implementations like SparseMatrixTuple or FeatureVectorTuple. Would people be opposed to making some "hooks" in BinInterSedes for user-defined tuple types? I was thinking there could be some config which maps these hooks (data type bytes) to user-defined classes and uses reflection to instantiate and read them. Not sure if that would be performant though.

      Thanks for reading all that!

      1. PIG-3429-v2.diff
        74 kB
        Jonathan Packer
      2. PIG-3429-v1.diff
        75 kB
        Jonathan Packer

        Activity

        Hide
        Jonathan Packer added a comment -

        Hi, so the current patch now seems to pass every unit tests except ones which use tuple's append() method which breaks. I have an idea for fixing this, but wanted to wait for feedback to make sure I'm going in the right direction. I know this is changing some important classes, but I think the memory improvements could especially help make Pig local mode more viable for general-purpose use as memory is more of an issue on laptops then on clusters.

        My idea for fixing append() is that for the specialized tuple impls, they have an extra field "Tuple promotedTuple". This is null by default, so it only adds 8 bytes of overhead (still much cheaper than the ArrayList when it is unused). If someone needs to append to the specialized tuple, the existing fields are copied into a new default tuple in the "promotedTuple" field and that is just used by proxy. So there is a small overhead vs default when use append, but for most cases where append is not used you retain the memory savings of the specialized tuples. Does this seem like an workable idea?

        Show
        Jonathan Packer added a comment - Hi, so the current patch now seems to pass every unit tests except ones which use tuple's append() method which breaks. I have an idea for fixing this, but wanted to wait for feedback to make sure I'm going in the right direction. I know this is changing some important classes, but I think the memory improvements could especially help make Pig local mode more viable for general-purpose use as memory is more of an issue on laptops then on clusters. My idea for fixing append() is that for the specialized tuple impls, they have an extra field "Tuple promotedTuple". This is null by default, so it only adds 8 bytes of overhead (still much cheaper than the ArrayList when it is unused). If someone needs to append to the specialized tuple, the existing fields are copied into a new default tuple in the "promotedTuple" field and that is just used by proxy. So there is a small overhead vs default when use append, but for most cases where append is not used you retain the memory savings of the specialized tuples. Does this seem like an workable idea?
        Hide
        Jonathan Packer added a comment -

        Fixed a bunch of unit-test failures due to bugs in the compareTo + hashCode impls for the specialized tuples. Should be almost stable now. Will run the full unit-tests again to be sure.

        Show
        Jonathan Packer added a comment - Fixed a bunch of unit-test failures due to bugs in the compareTo + hashCode impls for the specialized tuples. Should be almost stable now. Will run the full unit-tests again to be sure.
        Hide
        Jonathan Packer added a comment -

        Finished running the full unit-tests, as expected it breaks some stuff. Fortunately, it looks like small bugs rather than anything fundamentally broken. Will work on fixing these.

        Show
        Jonathan Packer added a comment - Finished running the full unit-tests, as expected it breaks some stuff. Fortunately, it looks like small bugs rather than anything fundamentally broken. Will work on fixing these.
        Hide
        Jonathan Packer added a comment -

        Posted to ReviewBoard: https://reviews.apache.org/r/13630/

        Show
        Jonathan Packer added a comment - Posted to ReviewBoard: https://reviews.apache.org/r/13630/
        Hide
        Jonathan Packer added a comment -

        V1 of the patch

        Show
        Jonathan Packer added a comment - V1 of the patch

          People

          • Assignee:
            Jonathan Packer
            Reporter:
            Jonathan Packer
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development