Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4175

PIG CROSS operation follow by STORE produces non-deterministic results each run

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11, 0.12.0
    • 0.14.0
    • None
    • None
    • RHEL 6/64-bit

    • Reviewed

    Description

      Three files will be attached to help visualize this issue.

      1. mktestdata.py - to generate test data to feed the pig script
      2. test_cross.pig - the PIG script using CROSS and STORE
      3. test_cross.out - the PIG console output showing the input/output records delta

      To reproduce this PIG CROSS operation problem, you need to use the supplied Python script,
      mktestdata.py, to generate an input file that is at least 13,948,228,930 bytes (> 13GB).

      The CROSS between raw_data (m records) and cross_count (1 record) should yield exactly (m records) as the output.
      The STORE results from the CROSS operations yielded about 1/3 of input record in raw_data as the output.

      If I joined the both of the CROSS operations together, the STORE results from the CROSS operations yielded about 2/3
      of the input records in raw-data as the output.
      – data = CROSS raw_data, field04s_count, subsection1_field04s_count, subsection2_field04s_count;

      We have reproduced this using both Pig 0.11 (Hadoop 1.x) and Pig 0.12 (Hadoop 2.x) clusters.
      The default HDFS block size is 128MB.

      Attachments

        1. PIG-4175-additional-1.patch
          10 kB
          Rohini Palaniswamy
        2. PIG-4175-Debug.patch
          3 kB
          Rohini Palaniswamy
        3. PIG-4175-1.patch
          8 kB
          Daniel Dai
        4. pig_testcross_plan.png
          194 kB
          Jim Huang
        5. test_cross.out
          2 kB
          Jim Huang
        6. test_cross.pig
          3 kB
          Jim Huang
        7. mktestdata.py
          0.6 kB
          Jim Huang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            daijy Daniel Dai
            jimhuang Jim Huang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment