Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3251

Bzip2TextInputFormat requires double the memory of maximum record size

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • None
    • None
    • With hadoop 2.X or later, pig will use hadoop's bzip codec to handle bzip inputs. (To turn it off, set pig.bzip.use.hadoop.inputformat=false)

    Description

      While looking at user's OOM heap dump, noticed that pig's Bzip2TextInputFormat consumes memory at both

      Bzip2TextInputFormat.buffer (ByteArrayOutputStream)
      and actual Text that is returned as line.

      For example, when having one record with 160MBytes, buffer was 268MBytes and Text was 160MBytes.

      We can probably eliminate one of them.

      Attachments

        1. pig-3251-trunk-v01.patch
          4 kB
          Koji Noguchi
        2. pig-3251-trunk-v02.patch
          5 kB
          Koji Noguchi
        3. pig-3251-trunk-v03.patch
          3 kB
          Koji Noguchi
        4. pig-3251-trunk-v04.patch
          6 kB
          Koji Noguchi
        5. pig-3251-trunk-v05.patch
          5 kB
          Koji Noguchi
        6. pig-3251-trunk-v06.patch
          20 kB
          Koji Noguchi
        7. pig-3251-trunk-v07.patch
          21 kB
          Koji Noguchi
        8. pig-3251-trunk-v08.patch
          21 kB
          Koji Noguchi
        9. pig-3251-trunk-v09.patch
          22 kB
          Koji Noguchi

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            knoguchi Koji Noguchi
            knoguchi Koji Noguchi
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment