[PIG-4175] PIG CROSS operation follow by STORE produces non-deterministic results each run - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.11, 0.12.0
Fix Version/s: 0.14.0
Component/s: None
Labels:
None
Environment:

RHEL 6/64-bit

Hadoop Flags:

Reviewed

Description

Three files will be attached to help visualize this issue.

1. mktestdata.py - to generate test data to feed the pig script
2. test_cross.pig - the PIG script using CROSS and STORE
3. test_cross.out - the PIG console output showing the input/output records delta

To reproduce this PIG CROSS operation problem, you need to use the supplied Python script,
mktestdata.py, to generate an input file that is at least 13,948,228,930 bytes (> 13GB).

The CROSS between raw_data (m records) and cross_count (1 record) should yield exactly (m records) as the output.
The STORE results from the CROSS operations yielded about 1/3 of input record in raw_data as the output.

If I joined the both of the CROSS operations together, the STORE results from the CROSS operations yielded about 2/3
of the input records in raw-data as the output.
– data = CROSS raw_data, field04s_count, subsection1_field04s_count, subsection2_field04s_count;

We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 2.x) clusters.
The default HDFS block size is 128MB.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PIG-4175-additional-1.patch
01/Oct/14 22:05
10 kB
Rohini Palaniswamy
PIG-4175-Debug.patch
01/Oct/14 19:29
3 kB
Rohini Palaniswamy
PIG-4175-1.patch
21/Sep/14 19:13
8 kB
Daniel Dai
pig_testcross_plan.png
16/Sep/14 13:58
194 kB
Jim Huang
test_cross.out
16/Sep/14 13:58
2 kB
Jim Huang
test_cross.pig
16/Sep/14 13:58
3 kB
Jim Huang
mktestdata.py
16/Sep/14 13:56
0.6 kB
Jim Huang

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Daniel Dai

Reporter:: Jim Huang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Sep/14 13:48

Updated:: 21/Nov/14 05:58

Resolved:: 01/Oct/14 06:45

Agile

View on Board

PIG CROSS operation follow by STORE produces non-deterministic results each run

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment