Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
With hadoop 2.X or later, pig will use hadoop's bzip codec to handle bzip inputs. (To turn it off, set pig.bzip.use.hadoop.inputformat=false)
Description
While looking at user's OOM heap dump, noticed that pig's Bzip2TextInputFormat consumes memory at both
Bzip2TextInputFormat.buffer (ByteArrayOutputStream)
and actual Text that is returned as line.
For example, when having one record with 160MBytes, buffer was 268MBytes and Text was 160MBytes.
We can probably eliminate one of them.
Attachments
Attachments
Issue Links
- is duplicated by
-
PIG-4591 Drop use of the internal Bzip2TextInputFormat
- Resolved
- is related to
-
MAPREDUCE-5656 bzip2 codec can drop records when reading data in splits
- Closed
-
PIG-4533 Document error: Pig does support concatenated gz file
- Closed
- relates to
-
PIG-3352 Bzip2TextInputFormat can duplicate records across splits
- Open