Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3251

Bzip2TextInputFormat requires double the memory of maximum record size

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • None
    • None
    • With hadoop 2.X or later, pig will use hadoop's bzip codec to handle bzip inputs. (To turn it off, set pig.bzip.use.hadoop.inputformat=false)

    Description

      While looking at user's OOM heap dump, noticed that pig's Bzip2TextInputFormat consumes memory at both

      Bzip2TextInputFormat.buffer (ByteArrayOutputStream)
      and actual Text that is returned as line.

      For example, when having one record with 160MBytes, buffer was 268MBytes and Text was 160MBytes.

      We can probably eliminate one of them.

      Attachments

        1. pig-3251-trunk-v01.patch
          4 kB
          Koji Noguchi
        2. pig-3251-trunk-v02.patch
          5 kB
          Koji Noguchi
        3. pig-3251-trunk-v03.patch
          3 kB
          Koji Noguchi
        4. pig-3251-trunk-v04.patch
          6 kB
          Koji Noguchi
        5. pig-3251-trunk-v05.patch
          5 kB
          Koji Noguchi
        6. pig-3251-trunk-v06.patch
          20 kB
          Koji Noguchi
        7. pig-3251-trunk-v07.patch
          21 kB
          Koji Noguchi
        8. pig-3251-trunk-v08.patch
          21 kB
          Koji Noguchi
        9. pig-3251-trunk-v09.patch
          22 kB
          Koji Noguchi

        Issue Links

          Activity

            People

              knoguchi Koji Noguchi
              knoguchi Koji Noguchi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: