Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
People often get confused and end up using . after a join instead of :: (PIG-2134) to reference fields of the join. Pig takes it to be a scalar and tries to read the whole join data assuming there is only one record and then fails with "scalar has more than one row in the output". ReadScalars uses InterStorage to read the scalar data. It uses InterInputFormat which extends FileInputFormat and for getSplits(), there is a listStatus on all the files in the dir to construct the splits and then InterRecordReader is used on the splits to read the data. When the data is really huge with lot of files, this can cause the namenode to go out of memory and crash and especially with the recent optimization change in hadoop 0.23/2.x that uses listLocatedStatus instead of listStatus + getBlockLocations in FileInputFormat to reduce number of calls to Namenode.
In the particular case we encountered, the join output had 6.5K files. On the job that did ReadScalars, it had 8K+ tasks (with speculative execution) and all of them doing listStatus for the 6.5K files caused NN queue to fill up with huge responses (block locations makes the response even bigger). It also saturated the network, causing responses to build up in NN without being sent finally leading it to crash with OOM.
Attachments
Issue Links
- is related to
-
PIG-2629 Wrong Usage of Scalar which is null causes high namenode operation
- Closed