Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4821

Pig chararray field with special UTF-8 chars as part of tuple join key produces wrong results in Tez

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0, 0.15.1
    • None
    • None
    • Reviewed

    Description

      SedesHelper.writeChararray does writeUTF, but we do str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8); when reading it in the BinInterSedesTupleRawComparator https://github.com/apache/pig/blob/e0c5f265c68491395d8303c86195445be3d8aecf/src/org/apache/pig/data/BinInterSedes.java#L959-L964. For some reason, this works fine in my MAC (both jdk7 and jdk8) but not in Linux. Not sure about the actual cause and have not dug into it. Suspecting either charset environment or the specific update of jdk 8 (different in my MAC and Linux).

      Attachments

        1. PIG-4821-1.patch
          2 kB
          Rohini Palaniswamy

        Issue Links

          Activity

            Attaching patch. I have not been able to come up with a test case for it and had postponed uploading the patch for long. Patch is simple. Been struggling with test case due to two issues

            • Not easily reproducible in my laptop. Tried playing around with encodings - sun.io.unicode.encoding, jdk and different subset of data and it did not work. Issue is reproducible sometimes, but is rare and not repeatable.
            • Narrow down to a smaller dataset for the test. Issue occurs during sorting and happens only when specific order of data go through comparison. Not able to exactly narrow down the minimal set of records from the larger data.

            This patch is important and needs to go into the release. Will create a separate jira to add testcase later.

            rohini Rohini Palaniswamy added a comment - Attaching patch. I have not been able to come up with a test case for it and had postponed uploading the patch for long. Patch is simple. Been struggling with test case due to two issues Not easily reproducible in my laptop. Tried playing around with encodings - sun.io.unicode.encoding, jdk and different subset of data and it did not work. Issue is reproducible sometimes, but is rare and not repeatable. Narrow down to a smaller dataset for the test. Issue occurs during sorting and happens only when specific order of data go through comparison. Not able to exactly narrow down the minimal set of records from the larger data. This patch is important and needs to go into the release. Will create a separate jira to add testcase later.
            daijy Daniel Dai added a comment -

            +1

            daijy Daniel Dai added a comment - +1

            Committed to branch-0.15, branch-0.16 and trunk. Thanks for the review Daniel.

            rohini Rohini Palaniswamy added a comment - Committed to branch-0.15, branch-0.16 and trunk. Thanks for the review Daniel.

            People

              rohini Rohini Palaniswamy
              rohini Rohini Palaniswamy
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: