Pig
  1. Pig
  2. PIG-1468

DataByteArray.compareTo() does not compare in lexicographic order

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The compareTo() method of org.apache.pig.data.DataByteArray does not compare items in lexicographic order.
      Actually, it takes into account the signum of the bytes that compose the DataByteArray.

      So, for example, 0xff compares to less than 0x00

      1. PIG-1468.patch
        2 kB
        Gianmarco De Francisci Morales

        Activity

        Gianmarco De Francisci Morales created issue -
        Gianmarco De Francisci Morales made changes -
        Field Original Value New Value
        Attachment PIG-1468.patch [ 12448155 ]
        Gianmarco De Francisci Morales made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Gianmarco De Francisci Morales made changes -
        Assignee Gianmarco De Francisci Morales [ azaroth ]
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12448155/PIG-1468.patch
        against trunk revision 958053.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448155/PIG-1468.patch against trunk revision 958053. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/354/console This message is automatically generated.
        Hide
        Gianmarco De Francisci Morales added a comment -

        The test failures seem related to hudson, I have no failure locally.
        I think the patch is ready for review.

        Show
        Gianmarco De Francisci Morales added a comment - The test failures seem related to hudson, I have no failure locally. I think the patch is ready for review.
        Hide
        Daniel Dai added a comment -

        The problem is we do not have unsigned byte in Java. Although DataByteArray is for semantic unknown type and the actual order does not matter, it is more natural to see 0xff > 0x00. I have two comments:
        1. Can we measure if there is any noticeable performance downgrade due to two additional "and" operation?
        2. Do we have somewhere else use this logic? It is important to keep them consistent.

        Show
        Daniel Dai added a comment - The problem is we do not have unsigned byte in Java. Although DataByteArray is for semantic unknown type and the actual order does not matter, it is more natural to see 0xff > 0x00. I have two comments: 1. Can we measure if there is any noticeable performance downgrade due to two additional "and" operation? 2. Do we have somewhere else use this logic? It is important to keep them consistent.
        Hide
        Gianmarco De Francisci Morales added a comment -

        1) I will write a simple program to measure the performance impact.

        2) I think this has no correlation to other places, but I will check.
        Furthermore, this patch makes the ordering consistent with Hadoop's WritableComparator.compareBytes() (lexicographic order of binary data).

        Show
        Gianmarco De Francisci Morales added a comment - 1) I will write a simple program to measure the performance impact. 2) I think this has no correlation to other places, but I will check. Furthermore, this patch makes the ordering consistent with Hadoop's WritableComparator.compareBytes() (lexicographic order of binary data).
        Hide
        Gianmarco De Francisci Morales added a comment -

        I ran some tests. I see a ~1% decrease in performance overall.

        I looked around the codebase for references to the method, and it does not seem there is any place that relies on the specific ordering.

        Here is the code I used:

        import java.util.Random;
        
        public class TestSpeed {
            private static final int TIMES = (int) 10e6;
            private static final int NUM_ARRAYS = (int) 10e5;
            private static final int ARRAY_LENGTH = 50;
        
            private static int compareSigned(byte[] b1, byte[] b2) {
                int i;
                for (i = 0; i < b1.length; i++) {
                    if (i >= b2.length)
                        return 1;
                    int a = b1[i];
                    int b = b2[i];
                    if (a < b)
                        return -1;
                    else if (a > b)
                        return 1;
                }
                if (i < b2.length)
                    return -1;
                return 0;
            }
        
            private static int compareUnsisgned(byte[] b1, byte[] b2) {
                int i;
                for (i = 0; i < b1.length; i++) {
                    if (i >= b2.length)
                        return 1;
                    int a = b1[i] & 0xff;
                    int b = b2[i] & 0xff;
                    if (a < b)
                        return -1;
                    else if (a > b)
                        return 1;
                }
                if (i < b2.length)
                    return -1;
                return 0;
            }
        
            public static void main(String[] args) {
                long before, after;
                Random rand = new Random(123456789);
                byte[][] batch1 = new byte[NUM_ARRAYS][];
                byte[][] batch2 = new byte[NUM_ARRAYS][];
                for (int i = 0; i < NUM_ARRAYS; i++) {
                    batch1[i] = new byte[ARRAY_LENGTH];
                    batch2[i] = new byte[ARRAY_LENGTH];
                    rand.nextBytes(batch1[i]);
                    rand.nextBytes(batch2[i]);
                }
        
                before = System.currentTimeMillis();
                for (int i = 0; i < TIMES; i++)
                    for (int j = 0; j < ARRAY_LENGTH; j++)
                        compareSigned(batch1[j], batch2[j]);
                after = System.currentTimeMillis();
                System.out.println("Time for signed comparison (ms): " + (after - before));
        
                before = System.currentTimeMillis();
                for (int i = 0; i < TIMES; i++)
                    for (int j = 0; j < ARRAY_LENGTH; j++)
                        compareUnsisgned(batch1[j], batch2[j]);
                after = System.currentTimeMillis();
                System.out.println("Time for UNsigned comparison (ms): " + (after - before));
            }
        }
        
        Show
        Gianmarco De Francisci Morales added a comment - I ran some tests. I see a ~1% decrease in performance overall. I looked around the codebase for references to the method, and it does not seem there is any place that relies on the specific ordering. Here is the code I used: import java.util.Random; public class TestSpeed { private static final int TIMES = ( int ) 10e6; private static final int NUM_ARRAYS = ( int ) 10e5; private static final int ARRAY_LENGTH = 50; private static int compareSigned( byte [] b1, byte [] b2) { int i; for (i = 0; i < b1.length; i++) { if (i >= b2.length) return 1; int a = b1[i]; int b = b2[i]; if (a < b) return -1; else if (a > b) return 1; } if (i < b2.length) return -1; return 0; } private static int compareUnsisgned( byte [] b1, byte [] b2) { int i; for (i = 0; i < b1.length; i++) { if (i >= b2.length) return 1; int a = b1[i] & 0xff; int b = b2[i] & 0xff; if (a < b) return -1; else if (a > b) return 1; } if (i < b2.length) return -1; return 0; } public static void main( String [] args) { long before, after; Random rand = new Random(123456789); byte [][] batch1 = new byte [NUM_ARRAYS][]; byte [][] batch2 = new byte [NUM_ARRAYS][]; for ( int i = 0; i < NUM_ARRAYS; i++) { batch1[i] = new byte [ARRAY_LENGTH]; batch2[i] = new byte [ARRAY_LENGTH]; rand.nextBytes(batch1[i]); rand.nextBytes(batch2[i]); } before = System .currentTimeMillis(); for ( int i = 0; i < TIMES; i++) for ( int j = 0; j < ARRAY_LENGTH; j++) compareSigned(batch1[j], batch2[j]); after = System .currentTimeMillis(); System .out.println( "Time for signed comparison (ms): " + (after - before)); before = System .currentTimeMillis(); for ( int i = 0; i < TIMES; i++) for ( int j = 0; j < ARRAY_LENGTH; j++) compareUnsisgned(batch1[j], batch2[j]); after = System .currentTimeMillis(); System .out.println( "Time for UNsigned comparison (ms): " + (after - before)); } }
        Hide
        Daniel Dai added a comment -

        The other concern is we only change DataByteArray not byte. So comparator for DataType.BYTEARRAY and DataType.BYTE is different. This will cause confusion.

        Show
        Daniel Dai added a comment - The other concern is we only change DataByteArray not byte. So comparator for DataType.BYTEARRAY and DataType.BYTE is different. This will cause confusion.
        Hide
        Gianmarco De Francisci Morales added a comment -

        It is quite easy to fix DataType.compare() to keep into account the unsigned logic.
        But I am starting to feel that all of this is probably not worth the trouble.
        This would make DataType.compare() for Bytes different from Byte.compareTo().

        Show
        Gianmarco De Francisci Morales added a comment - It is quite easy to fix DataType.compare() to keep into account the unsigned logic. But I am starting to feel that all of this is probably not worth the trouble. This would make DataType.compare() for Bytes different from Byte.compareTo().
        Hide
        Gianmarco De Francisci Morales added a comment -

        We want to keep the comparators consistent.

        Show
        Gianmarco De Francisci Morales added a comment - We want to keep the comparators consistent.
        Gianmarco De Francisci Morales made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]

          People

          • Assignee:
            Gianmarco De Francisci Morales
            Reporter:
            Gianmarco De Francisci Morales
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development