Pig
  1. Pig
  2. PIG-2638

Optimize BinInterSedes treatment of longs

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      During adventures in BinInterSedes, I noticed that Integers are written in an optimized fashion, but longs are not. Given that, in the general case, we have to write type information anyway, we might as well do the same optimization for Longs. That is to say, given that most longs won't have 8 bytes of information in them, why should we waste the space of serializing 8 bytes?

      This patch takes its inspiration from varint encoding per these two sources:
      http://javasourcecode.org/html/open-source/mahout/mahout-0.5/org/apache/mahout/math/Varint.java.html
      https://developers.google.com/protocol-buffers/docs/encoding

      Though, nicely enough, we don't actually have to use varints. Since we HAVE to write an 8 byte type header, we might as well include the number of bytes we had to write. I use zig zag encoding so that in the case of negative numbers, we see the benefit.

      This should decrease the amount of serialized long data by a good bit.

      Patch incoming. It passes test-commit in 0.11.

      1. PIG-2638-0.patch
        8 kB
        Jonathan Coveney
      2. PIG-2638-1.patch
        6 kB
        Jonathan Coveney

        Activity

        Hide
        Jonathan Coveney added a comment -

        Committed, revision 1344830. There are definitely gains in this space, but the big one IMHO is doing intelligent buffering.

        Show
        Jonathan Coveney added a comment - Committed, revision 1344830. There are definitely gains in this space, but the big one IMHO is doing intelligent buffering.
        Hide
        Jonathan Coveney added a comment -

        Ashutosh,

        Given the way that we currently serialize values, there is actually no gain to using varint, because we are, in all cases, writing a byte value that specifies what is being serialized. In fact we can do better than varint... instead of needing a bit flag at the head of every byte, we can just have something like the follows:

        INT_1BYTE,
        INT_2BYTE,
        INT_3BYTE,
        INT_4BYTE

        and the same analogue for the long. Given that currently there is no way NOT to write that object identification byte, the gain from varint/varlong doesn't exist, since you can do it more compactly (given what we do) anyway). However, as I work to erase that need (in SchemaTuple, for example), varint/varlong begin to make a lot more sense.

        I think this patch is some really easy low hanging fruit, and in the future I have some ideas around how to greatly improve serialization performance that will be more sweeping.

        Would love your thoughts.

        Show
        Jonathan Coveney added a comment - Ashutosh, Given the way that we currently serialize values, there is actually no gain to using varint, because we are, in all cases, writing a byte value that specifies what is being serialized. In fact we can do better than varint... instead of needing a bit flag at the head of every byte, we can just have something like the follows: INT_1BYTE, INT_2BYTE, INT_3BYTE, INT_4BYTE and the same analogue for the long. Given that currently there is no way NOT to write that object identification byte, the gain from varint/varlong doesn't exist, since you can do it more compactly (given what we do) anyway). However, as I work to erase that need (in SchemaTuple, for example), varint/varlong begin to make a lot more sense. I think this patch is some really easy low hanging fruit, and in the future I have some ideas around how to greatly improve serialization performance that will be more sweeping. Would love your thoughts.
        Hide
        Ashutosh Chauhan added a comment -

        Another option is to use VInts and its friends which are used in core hadoop and in other parts of the ecosystem. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/VIntWritable.html

        Show
        Ashutosh Chauhan added a comment - Another option is to use VInts and its friends which are used in core hadoop and in other parts of the ecosystem. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/VIntWritable.html
        Hide
        Bill Graham added a comment -

        +1

        Looks good to me.

        Show
        Bill Graham added a comment - +1 Looks good to me.
        Hide
        Jonathan Coveney added a comment -

        Bump. This bad boy is incremental but has literally no downside. I'm beginning to think we should rip out the custom intermediate and replace it with Kryo but until such a time, I think this is appropriate.

        Show
        Jonathan Coveney added a comment - Bump. This bad boy is incremental but has literally no downside. I'm beginning to think we should rip out the custom intermediate and replace it with Kryo but until such a time, I think this is appropriate.
        Hide
        Jonathan Coveney added a comment -

        This is a newer version that is better. I basically just used the same method we use for Ints (where if it's a byte write a byte, a short white a short, and so on) instead of the previous method. This does mean that for anything larger than an int the whole long will be written, but meh. It's an improvement, and there is no performance degradation at all, so it's an easy win IMHO.

        Calipers output:

         New
        size    us linear runtime
           5  30.0 =====
          10  55.5 ==========
          15  82.0 ===============
          20 105.4 ====================
          25 135.7 ==========================
          30 156.1 ==============================
        
        Old
           5  30.7 =====
          10  55.5 ==========
          15  79.5 ===============
          20 105.2 ====================
          25 130.4 =========================
          30 156.0 ==============================
        

        The benchmark was simply serializing a tuple of size x and then immediately deserializing it. As you can see, no difference, and the version with the patch will take up less space on disk.

        Show
        Jonathan Coveney added a comment - This is a newer version that is better. I basically just used the same method we use for Ints (where if it's a byte write a byte, a short white a short, and so on) instead of the previous method. This does mean that for anything larger than an int the whole long will be written, but meh. It's an improvement, and there is no performance degradation at all, so it's an easy win IMHO. Calipers output: New size us linear runtime 5 30.0 ===== 10 55.5 ========== 15 82.0 =============== 20 105.4 ==================== 25 135.7 ========================== 30 156.1 ============================== Old 5 30.7 ===== 10 55.5 ========== 15 79.5 =============== 20 105.2 ==================== 25 130.4 ========================= 30 156.0 ============================== The benchmark was simply serializing a tuple of size x and then immediately deserializing it. As you can see, no difference, and the version with the patch will take up less space on disk.

          People

          • Assignee:
            Jonathan Coveney
            Reporter:
            Jonathan Coveney
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development