[PIG-2629] Wrong Usage of Scalar which is null causes high namenode operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.1, 0.9.2, 0.10.0
Fix Version/s: 0.12.1, 0.13.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

Hi,

Script

A = LOAD 'test3.txt'   AS (from:chararray);
B = LOAD 'test2.txt'    AS (source:chararray,to:chararray);
C = FILTER A BY (from == 'temp' );
D = FILTER B BY (source MATCHES '.*xyz*.');
E = JOIN C by (from) left outer,D by (to);
F = FILTER E BY (D.to IS NULL);
dump F;

Inputs

$ cat test2.txt
temp    temp
temp    temp
temp    temp
temp    temp
temp    temp
tepm    tepm

$ cat test3.txt  |head
temp
temp
temp
temp
temp
temp
tepm
temp
temp
temp

Here I have by mistake called 'to' using 'D.to' instead of 'D::to'. The D relation gives null output.

First Map Reduce job computes D which give null results.
The MapPlan of 2nd job

Union[tuple] - scope-56
|
|---E: Local Rearrange[tuple]{chararray}(false) - scope-36
|   |   |
|   |   Project[chararray][0] - scope-37
|   |
|   |---C: Filter[bag] - scope-26
|       |   |
|       |   Equal To[boolean] - scope-29
|       |   |
|       |   |---Project[chararray][0] - scope-27
|       |   |
|       |   |---Constant(temp) - scope-28
|       |
|       |---A: New For Each(false)[bag] - scope-25
|           |   |
|           |   Cast[chararray] - scope-23
|           |   |
|           |   |---Project[bytearray][0] - scope-22
|           |
|           |---F: Filter[bag] - scope-17
|               |   |
|               |   POIsNull[boolean] - scope-21
|               |   |
|               |   |---POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[chararray] - scope-20
|               |       |
|               |       |---Constant(1) - scope-18
|               |       |
|               |       |---Constant(hdfs://nn-nn1/tmp/temp-1607149525/tmp281350188) - scope-19
|               |
|               |---A: Load(hdfs://nn-nn1/user/anithar/test3.txt:org.apache.pig.builtin.PigStorage) - scope-0
|
|---E: Local Rearrange[tuple]{chararray}(false) - scope-38
    |   |
    |   Project[chararray][1] - scope-39
    |
    |---Load(hdfs://nn-nn1/tmp/temp-1607149525/tmp-458164144:org.apache.pig.impl.io.TFileStorage) - scope-53--------

Here at F , the file /tmp/temp-1607149525/tmp281350188 which is the output of the 1st Mapreduce Job is repeatedly read.
If the input to F was non empty, since I am calling the scalar wrongly, it would have failed with the expected error message 'Scalar has more than 1 row in the output'.

But since its null, it returns in ReadScalars before the exception is thrown and gives these in the task logs repeatedly

2012-04-03 11:46:58,824 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input paths to process : 1
2012-04-03 11:46:58,824 INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil: Total input paths to process : 1
2012-04-03 11:46:58,827 WARN org.apache.pig.impl.builtin.ReadScalars: No scalar field to read, returning null
....
....

That is its reading the '/tmp/temp-1607149525/tmp281350188' file again and again which was causing high namenode operation.
The cost of one small mistake had ended up causing heavy namenode operations.

Regards,
Anitha

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-2629-1.patch
23/Oct/13 17:49
2 kB
Rohini Palaniswamy
PIG-2629-2.patch
06/Dec/13 00:30
4 kB
Rohini Palaniswamy

Issue Links

relates to

PIG-3669 Wrong Usage of Scalar can bring down Namenode

Open

Activity

People

Assignee:: Rohini Palaniswamy

Reporter:: Anitha Raju

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Apr/12 14:52

Updated:: 15/Apr/14 20:44

Resolved:: 06/Dec/13 00:45