Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3251

Bzip2TextInputFormat requires double the memory of maximum record size

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      With hadoop 2.X or later, pig will use hadoop's bzip codec to handle bzip inputs. (To turn it off, set pig.bzip.use.hadoop.inputformat=false)

      Description

      While looking at user's OOM heap dump, noticed that pig's Bzip2TextInputFormat consumes memory at both

      Bzip2TextInputFormat.buffer (ByteArrayOutputStream)
      and actual Text that is returned as line.

      For example, when having one record with 160MBytes, buffer was 268MBytes and Text was 160MBytes.

      We can probably eliminate one of them.

        Attachments

        1. pig-3251-trunk-v01.patch
          4 kB
          Koji Noguchi
        2. pig-3251-trunk-v02.patch
          5 kB
          Koji Noguchi
        3. pig-3251-trunk-v03.patch
          3 kB
          Koji Noguchi
        4. pig-3251-trunk-v04.patch
          6 kB
          Koji Noguchi
        5. pig-3251-trunk-v05.patch
          5 kB
          Koji Noguchi
        6. pig-3251-trunk-v06.patch
          20 kB
          Koji Noguchi
        7. pig-3251-trunk-v07.patch
          21 kB
          Koji Noguchi
        8. pig-3251-trunk-v08.patch
          21 kB
          Koji Noguchi
        9. pig-3251-trunk-v09.patch
          22 kB
          Koji Noguchi

          Issue Links

            Activity

              People

              • Assignee:
                knoguchi Koji Noguchi
                Reporter:
                knoguchi Koji Noguchi
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: