Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5040

Order by and CROSS partitioning is not deterministic due to usage of Random

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 0.17.0, 0.16.1
    • None
    • None
    • Reviewed

    Description

      Maps can be rerun due to shuffle fetch failures. Half of the reducers can end up successfully pulling partitions from first run of the map while other half could pull from the rerun after shuffle fetch failures. If the data is not partitioned by the Partitioner exactly the same way every time then it could lead to incorrect results (loss of records and duplicated records). Even though issue has existed for 8 years now with order by and affects mapreduce as well found this with Tez where the frequency of rerun due to shuffle fetch failures is high (Order by partitioner gets its data from a 1-1 edge, so there are no retries and shuffle fetch failures trigger a rerun immediately).

      Attachments

        1. PIG-5040-1.patch
          19 kB
          Rohini Palaniswamy
        2. PIG-5040-1-nowhitespacechanges.patch
          13 kB
          Rohini Palaniswamy

        Issue Links

          Activity

            People

              rohini Rohini Palaniswamy
              rohini Rohini Palaniswamy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: