Pig
  1. Pig
  2. PIG-1295

Binary comparator for secondary sort

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.0
    • Fix Version/s: 0.8.0
    • Component/s: impl
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      When hadoop framework doing the sorting, it will try to use binary version of comparator if available. The benefit of binary comparator is we do not need to instantiate the object before we compare. We see a ~30% speedup after we switch to binary comparator. Currently, Pig use binary comparator in following case:

      1. When semantics of order doesn't matter. For example, in distinct, we need to do a sort in order to filter out duplicate values; however, we do not care how comparator sort keys. Groupby also share this character. In this case, we rely on hadoop's default binary comparator
      2. Semantics of order matter, but the key is of simple type. In this case, we have implementation for simple types, such as integer, long, float, chararray, databytearray, string

      However, if the key is a tuple and the sort semantics matters, we do not have a binary comparator implementation. This especially matters when we switch to use secondary sort. In secondary sort, we convert the inner sort of nested foreach into the secondary key and rely on hadoop to sorting on both main key and secondary key. The sorting key will become a two items tuple. Since the secondary key the sorting key of the nested foreach, so the sorting semantics matters. It turns out we do not have binary comparator once we use secondary sort, and we see a significant slow down.

      Binary comparator for tuple should be doable once we understand the binary structure of the serialized tuple. We can focus on most common use cases first, which is "group by" followed by a nested sort. In this case, we will use secondary sort. Semantics of the first key does not matter but semantics of secondary key matters. We need to identify the boundary of main key and secondary key in the binary tuple buffer without instantiate tuple itself. Then if the first key equals, we use a binary comparator to compare secondary key. Secondary key can also be a complex data type, but for the first step, we focus on simple secondary key, which is the most common use case.

      We mark this issue to be a candidate project for "Google summer of code 2010" program.

      1. PIG-1295_0.16.patch
        167 kB
        Daniel Dai
      2. PIG-1295_0.15.patch
        168 kB
        Daniel Dai
      3. PIG-1295_0.14.patch
        167 kB
        Daniel Dai
      4. PIG-1295_0.13.patch
        152 kB
        Gianmarco De Francisci Morales
      5. PIG-1295_0.12.patch
        131 kB
        Gianmarco De Francisci Morales
      6. PIG-1295_0.11.patch
        86 kB
        Gianmarco De Francisci Morales
      7. PIG-1295_0.10.patch
        59 kB
        Gianmarco De Francisci Morales
      8. PIG-1295_0.9.patch
        55 kB
        Gianmarco De Francisci Morales
      9. PIG-1295_0.8.patch
        36 kB
        Gianmarco De Francisci Morales
      10. PIG-1295_0.7.patch
        33 kB
        Gianmarco De Francisci Morales
      11. PIG-1295_0.6.patch
        31 kB
        Gianmarco De Francisci Morales
      12. PIG-1295_0.5.patch
        21 kB
        Gianmarco De Francisci Morales
      13. PIG-1295_0.4.patch
        14 kB
        Gianmarco De Francisci Morales
      14. PIG-1295_0.3.patch
        13 kB
        Gianmarco De Francisci Morales
      15. PIG-1295_0.2.patch
        12 kB
        Gianmarco De Francisci Morales
      16. PIG-1295_0.1.patch
        10 kB
        Gianmarco De Francisci Morales

        Issue Links

          Activity

          Hide
          Gianmarco De Francisci Morales added a comment -

          Avro provides efficient binary comparison for tuples with generic schemas.
          A simple way to implement this would be to adapt Pig to use Avro for intermediate files.
          This would allow to use generic schemas for keys and solve the genral problem in an efficient and elegant way.

          I would be glad to give it a try.

          Show
          Gianmarco De Francisci Morales added a comment - Avro provides efficient binary comparison for tuples with generic schemas. A simple way to implement this would be to adapt Pig to use Avro for intermediate files. This would allow to use generic schemas for keys and solve the genral problem in an efficient and elegant way. I would be glad to give it a try.
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          Thank you for your interest. Avro is one thing we definitely want to try. We have some previous work on Avro (PIG-794), but we haven't actively working on it for a while. The previous implementation might not be very right, and it is based on an very old version of Avro, and finally we did not include it into Pig codebase. We will be very happy if you want to try it again, and we would recommend you to build it from scratch. We are very interested to see the performance and additional gains provided by Avro.

          Show
          Daniel Dai added a comment - Hi, Gianmarco, Thank you for your interest. Avro is one thing we definitely want to try. We have some previous work on Avro ( PIG-794 ), but we haven't actively working on it for a while. The previous implementation might not be very right, and it is based on an very old version of Avro, and finally we did not include it into Pig codebase. We will be very happy if you want to try it again, and we would recommend you to build it from scratch. We are very interested to see the performance and additional gains provided by Avro.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I will start to look at the documentation and source code of both Pig and Avro to get a rough idea of the work to do.
          Maybe having a look at the old patch would be good to identify the points in the source code where changes are needed?

          Where can I discuss this issue and ideas to solve it? mailing list? irc?
          Do you have any other suggestion?

          Show
          Gianmarco De Francisci Morales added a comment - I will start to look at the documentation and source code of both Pig and Avro to get a rough idea of the work to do. Maybe having a look at the old patch would be good to identify the points in the source code where changes are needed? Where can I discuss this issue and ideas to solve it? mailing list? irc? Do you have any other suggestion?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Gianmarco, the ticket for Avro loader (PIG-794) is where you would discuss this. The old patch can be more or less discarded – the whole loader interface changed recently to make this sort of thing much easier and more powerful – but it's the issue folks who are interested in this are tracking.

          Show
          Dmitriy V. Ryaboy added a comment - Gianmarco, the ticket for Avro loader ( PIG-794 ) is where you would discuss this. The old patch can be more or less discarded – the whole loader interface changed recently to make this sort of thing much easier and more powerful – but it's the issue folks who are interested in this are tracking.
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          Mailing list and Jira should be a good place to discuss your ideas. Dmitriy is right, we should discuss Avro issues in PIG-794. If you are talking about "Google Summer of Code", Avro is definitely a candidate project for you to work.

          Show
          Daniel Dai added a comment - Hi, Gianmarco, Mailing list and Jira should be a good place to discuss your ideas. Dmitriy is right, we should discuss Avro issues in PIG-794 . If you are talking about "Google Summer of Code", Avro is definitely a candidate project for you to work.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Hi,
          I have been reading the source code and the referenced PIG-1038 issue.

          Probably Avro integration is too big of a project for GSoC, but implementing the tuple binary comparator seems doable.
          I will write a proposal, any advices for it?

          My idea of the project's breakdown would be like this:

          Identify the cases that can be optimized and the appropriate visitor for those.
          Write a test unit for this optimization.
          Implement the comparator knowing the data types of the tuple.
          Write a second test unit with different types.
          Write the logic to extract tuple boundary from schema information (I suppose this optimization is possible only if the schema is known)
          Try to extend it to the general case of complex data type as secondary key.

          Thoughts?

          Show
          Gianmarco De Francisci Morales added a comment - Hi, I have been reading the source code and the referenced PIG-1038 issue. Probably Avro integration is too big of a project for GSoC, but implementing the tuple binary comparator seems doable. I will write a proposal, any advices for it? My idea of the project's breakdown would be like this: Identify the cases that can be optimized and the appropriate visitor for those. Write a test unit for this optimization. Implement the comparator knowing the data types of the tuple. Write a second test unit with different types. Write the logic to extract tuple boundary from schema information (I suppose this optimization is possible only if the schema is known) Try to extend it to the general case of complex data type as secondary key. Thoughts?
          Hide
          Daniel Dai added a comment -

          Thanks Gianmarco,
          My suggestion is to divide it into two step:
          1. make binary comparator works
          2. integrate it into the current Pig code

          It is better to make sure we have quality deliverable for step 1 before we move to step 2.

          Show
          Daniel Dai added a comment - Thanks Gianmarco, My suggestion is to divide it into two step: 1. make binary comparator works 2. integrate it into the current Pig code It is better to make sure we have quality deliverable for step 1 before we move to step 2.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I have drafted my proposal at
          http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/azaroth/t127030843242

          Any feedback is more than welcome.

          Show
          Gianmarco De Francisci Morales added a comment - I have drafted my proposal at http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/azaroth/t127030843242 Any feedback is more than welcome.
          Hide
          Daniel Dai added a comment -

          Thanks Gianmarco,
          Proposal looks good. Besides unit test, we need to add some performance test in both phase 1 and phase 2.

          Show
          Daniel Dai added a comment - Thanks Gianmarco, Proposal looks good. Besides unit test, we need to add some performance test in both phase 1 and phase 2.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Here is my frist dratf for the binary comparator, together with some simple tests.

          This comparator should replace PigTupleRawComparator.

          I assume that the tuple we are comparing is a NullableTuple that wraps a DefaultTuple.
          There is a comment in the old comparator that says:

          // Users are allowed to implement their own versions of tuples, which means we have no
          // idea what the underlying representation is.
          

          This will need to be addressed. If I can know that the tuple is not a DefaultTuple looking at its data type byte, I can switch back to object deserialization. This means that the user must not use the same DataType.TUPLE as the default tuple.

          The comparator iterates over the serialized tuples and compares them following the same logic as the old comparator (first check if Null, then their sizes, then compare field by field, if the fields are of different types compare the datatypes, else compare the values).
          For now I implemented the logic only for simple types, I plan to add bytearrays and chararrays. I do not plan to add support for complex datatypes. In this case I will fall back to object deserialization.

          The implementation uses a ByteBuffer to easily iterate over the values. I don't know if it is acceptable or its performance impact but it is very handy. The alternative is manual bookkeeping of a cursor inside the byte arrays.
          To do this I would probably use a map that tells me how many bytes each data type uses. This map should probably go somewhere else though (DataType? DataReaderWriter?). I commented out an initial attempt for this at the end of the class.
          An alternative implementation could be to follow the strategy of the others PigRawComparators, that wrap a comparator from Hadoop. This would require keeping a number of comparators in memory, though. I commented out this approach at the beginning of the class.

          TODO:
          Implement fallback behaviour.
          Implement support for bytearrays and chararrays.
          Graceful handling of tuples different from DefaultTuple.
          Add performance tests.

          Any comment is more than welcome.

          Show
          Gianmarco De Francisci Morales added a comment - Here is my frist dratf for the binary comparator, together with some simple tests. This comparator should replace PigTupleRawComparator. I assume that the tuple we are comparing is a NullableTuple that wraps a DefaultTuple. There is a comment in the old comparator that says: // Users are allowed to implement their own versions of tuples, which means we have no // idea what the underlying representation is. This will need to be addressed. If I can know that the tuple is not a DefaultTuple looking at its data type byte, I can switch back to object deserialization. This means that the user must not use the same DataType.TUPLE as the default tuple. The comparator iterates over the serialized tuples and compares them following the same logic as the old comparator (first check if Null, then their sizes, then compare field by field, if the fields are of different types compare the datatypes, else compare the values). For now I implemented the logic only for simple types, I plan to add bytearrays and chararrays. I do not plan to add support for complex datatypes. In this case I will fall back to object deserialization. The implementation uses a ByteBuffer to easily iterate over the values. I don't know if it is acceptable or its performance impact but it is very handy. The alternative is manual bookkeeping of a cursor inside the byte arrays. To do this I would probably use a map that tells me how many bytes each data type uses. This map should probably go somewhere else though (DataType? DataReaderWriter?). I commented out an initial attempt for this at the end of the class. An alternative implementation could be to follow the strategy of the others PigRawComparators, that wrap a comparator from Hadoop. This would require keeping a number of comparators in memory, though. I commented out this approach at the beginning of the class. TODO: Implement fallback behaviour. Implement support for bytearrays and chararrays. Graceful handling of tuples different from DefaultTuple. Add performance tests. Any comment is more than welcome.
          Hide
          Daniel Dai added a comment -

          I briefly review the patch, looks good. This is the approach we expected. Can we do some initial performance test first?

          Show
          Daniel Dai added a comment - I briefly review the patch, looks good. This is the approach we expected. Can we do some initial performance test first?
          Hide
          Gianmarco De Francisci Morales added a comment -

          I added some simple performance tests.
          The tests generate 1 million tuples modifying a prototypical tuple and compare them to the prototype.
          One test uses the new comparator and the other uses the old one. I generate exactly the same tuples using a fixed seed. I also check the correctness of the comparisons using the normal compareTo() method of the tuples.

          The logic to generate the tuples is a bit involved: I tried to exercise all the datatype comparisons in a uniform manner, so I mutate less the first elements of the tuple, in order to have more probability of getting the comparison further down the tuple. The probabilities are totally made up and do not make much sense.

          As a first approximation, I see a slight overall speedup in the test.
          I will do some profiling to see which margins of improvement we have.

          Show
          Gianmarco De Francisci Morales added a comment - I added some simple performance tests. The tests generate 1 million tuples modifying a prototypical tuple and compare them to the prototype. One test uses the new comparator and the other uses the old one. I generate exactly the same tuples using a fixed seed. I also check the correctness of the comparisons using the normal compareTo() method of the tuples. The logic to generate the tuples is a bit involved: I tried to exercise all the datatype comparisons in a uniform manner, so I mutate less the first elements of the tuple, in order to have more probability of getting the comparison further down the tuple. The probabilities are totally made up and do not make much sense. As a first approximation, I see a slight overall speedup in the test. I will do some profiling to see which margins of improvement we have.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I did some more detailed profiling and testing. I ran a slightly modified version of the performance tests in the patch.
          On my machine, the new comparator takes between 8 and 10 seconds. The old one between 11 and 14.
          The performance advantage is not so big, but I found also that most of the time of the test is not spent comparting.
          The profiler shows that most of the time is spent in cloning (probably JUnit internals?):

              [junit]  84.0%     0  +  1769    java.lang.Object.clone
          

          For what concerns user code instead, most of the time is spent writing data. Randomization of the input also takes some time.
          The old comparator takes (0.9% + 0.3% | compareTuple + compare) more than 1% of the time. The new one takes only 0.2% of the time.

             [junit]   2.9%    62  +     0    org.apache.pig.data.DataReaderWriter.writeDatum
              [junit]   1.1%    23  +     0    java.io.DataOutputStream.writeInt
              [junit]   0.9%    19  +     0    java.util.Random.next
              [junit]   0.9%    18  +     0    java.io.DataOutputStream.writeLong
              [junit]   0.9%    18  +     0    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compareTuple
              [junit]   0.6%    11  +     1    org.apache.pig.data.DataReaderWriter.readDatum
              [junit]   0.5%    11  +     0    java.io.DataInputStream.readInt
              [junit]   0.5%    11  +     0    org.apache.pig.data.DefaultTuple.readFields
              [junit]   0.3%     7  +     0    org.apache.pig.impl.io.PigNullableWritable.write
              [junit]   0.3%     7  +     0    org.apache.pig.data.DefaultTuple.<init>
              [junit]   0.3%     3  +     3    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compare
              [junit]   0.2%     5  +     0    java.util.ArrayList.get
              [junit]   0.2%     5  +     0    org.apache.pig.data.DefaultTuple.write
              [junit]   0.2%     3  +     2    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparatorNew.compare
          

          All in all, these tests are probably not much representative, but I also feel that the speedup we may get with a raw comparator is somewhat limited.
          I am open to suggestion on how to modify the performance tests to make them more representative.

          Show
          Gianmarco De Francisci Morales added a comment - I did some more detailed profiling and testing. I ran a slightly modified version of the performance tests in the patch. On my machine, the new comparator takes between 8 and 10 seconds. The old one between 11 and 14. The performance advantage is not so big, but I found also that most of the time of the test is not spent comparting. The profiler shows that most of the time is spent in cloning (probably JUnit internals?): [junit] 84.0% 0 + 1769 java.lang. Object .clone For what concerns user code instead, most of the time is spent writing data. Randomization of the input also takes some time. The old comparator takes (0.9% + 0.3% | compareTuple + compare) more than 1% of the time. The new one takes only 0.2% of the time. [junit] 2.9% 62 + 0 org.apache.pig.data.DataReaderWriter.writeDatum [junit] 1.1% 23 + 0 java.io.DataOutputStream.writeInt [junit] 0.9% 19 + 0 java.util.Random.next [junit] 0.9% 18 + 0 java.io.DataOutputStream.writeLong [junit] 0.9% 18 + 0 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compareTuple [junit] 0.6% 11 + 1 org.apache.pig.data.DataReaderWriter.readDatum [junit] 0.5% 11 + 0 java.io.DataInputStream.readInt [junit] 0.5% 11 + 0 org.apache.pig.data.DefaultTuple.readFields [junit] 0.3% 7 + 0 org.apache.pig.impl.io.PigNullableWritable.write [junit] 0.3% 7 + 0 org.apache.pig.data.DefaultTuple.<init> [junit] 0.3% 3 + 3 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparator.compare [junit] 0.2% 5 + 0 java.util.ArrayList.get [junit] 0.2% 5 + 0 org.apache.pig.data.DefaultTuple.write [junit] 0.2% 3 + 2 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleRawComparatorNew.compare All in all, these tests are probably not much representative, but I also feel that the speedup we may get with a raw comparator is somewhat limited. I am open to suggestion on how to modify the performance tests to make them more representative.
          Hide
          Daniel Dai added a comment -

          "the new comparator takes between 8 and 10 seconds. The old one between 11 and 14", that is a 137% or 140% speedup. That's significant enough for us to go forward.

          Show
          Daniel Dai added a comment - "the new comparator takes between 8 and 10 seconds. The old one between 11 and 14", that is a 137% or 140% speedup. That's significant enough for us to go forward.
          Hide
          Daniel Dai added a comment -

          Can you also attach performance test code? I want to take a look. Thanks.

          Show
          Daniel Dai added a comment - Can you also attach performance test code? I want to take a look. Thanks.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Sure,
          here it is the revised file.
          I left the random testing for unit testing purposes, and I moved the performance testing to a separate main method.
          This makes profiling much easier.

          FYI, for profiling I am using a mixture of hprof (the java internal profiler) and jrat (http://jrat.sourceforge.net).

          I can prepare a more detailed report of where most of the time is consumed, if needed.
          Roughly, 80% of the test time is spent in the org.apache.pig.data package to write tuples (DefaultTuple.write() , DataReaderWriter.writeDatum() , DataType.findType() are the most expensive methods).
          From targeted profiling I have seen that the raw version of the compare method is around 3x faster than the old one.

          Show
          Gianmarco De Francisci Morales added a comment - Sure, here it is the revised file. I left the random testing for unit testing purposes, and I moved the performance testing to a separate main method. This makes profiling much easier. FYI, for profiling I am using a mixture of hprof (the java internal profiler) and jrat ( http://jrat.sourceforge.net ). I can prepare a more detailed report of where most of the time is consumed, if needed. Roughly, 80% of the test time is spent in the org.apache.pig.data package to write tuples (DefaultTuple.write() , DataReaderWriter.writeDatum() , DataType.findType() are the most expensive methods). From targeted profiling I have seen that the raw version of the compare method is around 3x faster than the old one.
          Hide
          Daniel Dai added a comment -

          I run your main problem. It also counts the tuple bytes generation time, which dominate the cpu profile. I run it in another way, generate the bytes first and don't include in the timing, and then compare the bytes using different comparator, Here is my code snippet:

          byte[][] toCompare1 = new byte[TIMES][];
          byte[][] toCompare2 = new byte[TIMES][];
          NullableTuple t;
          for (int i=0;i<TIMES;i++) {
              t = new NullableTuple(test.getRandomTuple(rand));
              t.write(test.dos1);
              toCompare1[i] = test.baos1.toByteArray();
          }
          for (int i=0;i<TIMES;i++) {
              t = new NullableTuple(test.getRandomTuple(rand));
              t.write(test.dos2);
              toCompare2[i] = test.baos2.toByteArray();
          }
          
          before = System.currentTimeMillis();
          for (int loop = 0; loop < 10000; loop++) {
              for (int i = 0; i < TIMES; i++) {
                  test.comparator.compare(toCompare1[i], 0, toCompare1[i].length, toCompare2[i], 0, toCompare2[i].length);
              }
          }
          after = System.currentTimeMillis();
          
          before = System.currentTimeMillis();
          for (int loop = 0; loop < 10000; loop++) {
              for (int i = 0; i < TIMES; i++) {
                  test.oldComparator.compare(toCompare1[i], 0, toCompare1[i].length, toCompare2[i], 0, toCompare2[i].length);
              }
          }
          after = System.currentTimeMillis();
          

          In this comparison, I see 6.5 times speedup.

          Also notice the code style for Apache is always use space instead of tab, make sure you take care of this.

          Show
          Daniel Dai added a comment - I run your main problem. It also counts the tuple bytes generation time, which dominate the cpu profile. I run it in another way, generate the bytes first and don't include in the timing, and then compare the bytes using different comparator, Here is my code snippet: byte [][] toCompare1 = new byte [TIMES][]; byte [][] toCompare2 = new byte [TIMES][]; NullableTuple t; for ( int i=0;i<TIMES;i++) { t = new NullableTuple(test.getRandomTuple(rand)); t.write(test.dos1); toCompare1[i] = test.baos1.toByteArray(); } for ( int i=0;i<TIMES;i++) { t = new NullableTuple(test.getRandomTuple(rand)); t.write(test.dos2); toCompare2[i] = test.baos2.toByteArray(); } before = System .currentTimeMillis(); for ( int loop = 0; loop < 10000; loop++) { for ( int i = 0; i < TIMES; i++) { test.comparator.compare(toCompare1[i], 0, toCompare1[i].length, toCompare2[i], 0, toCompare2[i].length); } } after = System .currentTimeMillis(); before = System .currentTimeMillis(); for ( int loop = 0; loop < 10000; loop++) { for ( int i = 0; i < TIMES; i++) { test.oldComparator.compare(toCompare1[i], 0, toCompare1[i].length, toCompare2[i], 0, toCompare2[i].length); } } after = System .currentTimeMillis(); In this comparison, I see 6.5 times speedup. Also notice the code style for Apache is always use space instead of tab, make sure you take care of this.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I modified the main in order to include your code snippet. Thanks very much for the suggestion!

          I reduced the number of tuples in order to avoid OutOfMemory exceptions on my machine (only 1 GiB of RAM). I see the same numbers you reported: 6.5 times improve in speed only for the compare() method.

          I took care of the tabs issue.

          I will add support for ByteArrays and CharArrays next, together with fallback behaviour for complex datatypes.

          Show
          Gianmarco De Francisci Morales added a comment - I modified the main in order to include your code snippet. Thanks very much for the suggestion! I reduced the number of tuples in order to avoid OutOfMemory exceptions on my machine (only 1 GiB of RAM). I see the same numbers you reported: 6.5 times improve in speed only for the compare() method. I took care of the tabs issue. I will add support for ByteArrays and CharArrays next, together with fallback behaviour for complex datatypes.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Added support for BYTEARRAY and CHARARRAY data types.
          Added fallback behaviour for unknown data types.
          Added support for non-raw comparison.

          I changed to public the visibility of a the charset field in DataType, because I need to know which encoding is being used for strings.

          I am testing the patch locally before submitting it.

          Given that performance tests look good, I think the next step would be to handle tuples different from DefaultTuple. This requires some changes to the infrastructure.

          Show
          Gianmarco De Francisci Morales added a comment - Added support for BYTEARRAY and CHARARRAY data types. Added fallback behaviour for unknown data types. Added support for non-raw comparison. I changed to public the visibility of a the charset field in DataType, because I need to know which encoding is being used for strings. I am testing the patch locally before submitting it. Given that performance tests look good, I think the next step would be to handle tuples different from DefaultTuple. This requires some changes to the infrastructure.
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          I think we can stick on DefaultTuple which is the major use case. I doubt there is a way to do a binary compare for an unknown tuple implementation. If user do not use DefaultTuple then we fall back to the deserialize version.

          Show
          Daniel Dai added a comment - Hi, Gianmarco, I think we can stick on DefaultTuple which is the major use case. I doubt there is a way to do a binary compare for an unknown tuple implementation. If user do not use DefaultTuple then we fall back to the deserialize version.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Ok, if the user does not use DefaultTuple we fall back to the default deserialization case.

          I added handling of nested tuples via recursion and appropriate unit tests.

          Show
          Gianmarco De Francisci Morales added a comment - Ok, if the user does not use DefaultTuple we fall back to the default deserialization case. I added handling of nested tuples via recursion and appropriate unit tests.
          Hide
          Daniel Dai added a comment -

          Thanks, is the patch ready for review?

          Show
          Daniel Dai added a comment - Thanks, is the patch ready for review?
          Hide
          Gianmarco De Francisci Morales added a comment -

          I think it is

          Show
          Gianmarco De Francisci Morales added a comment - I think it is
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12448251/PIG-1295_0.6.patch
          against trunk revision 958666.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 150 javac compiler warnings (more than the trunk's current 145 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          -1 release audit. The applied patch generated 402 release audit warnings (more than the trunk's current 399 warnings).

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/testReport/
          Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12448251/PIG-1295_0.6.patch against trunk revision 958666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 150 javac compiler warnings (more than the trunk's current 145 warnings). +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 402 release audit warnings (more than the trunk's current 399 warnings). -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/355/console This message is automatically generated.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Addressed a small bug and added ASF license to source files

          Show
          Gianmarco De Francisci Morales added a comment - Addressed a small bug and added ASF license to source files
          Hide
          Daniel Dai added a comment -

          Patch looks good, one comment: Where is the code "if the user does not use DefaultTuple we fall back to the default deserialization case"?

          Show
          Daniel Dai added a comment - Patch looks good, one comment: Where is the code "if the user does not use DefaultTuple we fall back to the default deserialization case"?
          Hide
          Thejas M Nair added a comment -

          1. If utf8 encoded ordering is same as java string compareTo ordering (i could not find a quick answer), it will possible to compare the strings by just comparing the bytes , instead of creating additional string objects.
          2. The comparison does not handle chararrays > 65k in length, ie when BIGCHARARRAY is used as the type in encoding. This is a problem even in existing pig code.
          3. The comparison function is based on the DefaultTuple serialization format, we need to make sure that it gets used only when DefaultTuple is the default tuple. I am making changes to use a new tuple with different serialization format in PIG-1472 . I think we should have this comparison logic defined in the class/interface where the serialization format is defined. I think it should be part of the InterSedes interface in the patch in PIG-1472 .

          Show
          Thejas M Nair added a comment - 1. If utf8 encoded ordering is same as java string compareTo ordering (i could not find a quick answer), it will possible to compare the strings by just comparing the bytes , instead of creating additional string objects. 2. The comparison does not handle chararrays > 65k in length, ie when BIGCHARARRAY is used as the type in encoding. This is a problem even in existing pig code. 3. The comparison function is based on the DefaultTuple serialization format, we need to make sure that it gets used only when DefaultTuple is the default tuple. I am making changes to use a new tuple with different serialization format in PIG-1472 . I think we should have this comparison logic defined in the class/interface where the serialization format is defined. I think it should be part of the InterSedes interface in the patch in PIG-1472 .
          Hide
          Gianmarco De Francisci Morales added a comment -

          I added the code for "if the user does not use DefaultTuple we fall back to the default deserialization case". I assume the user defined tuple will have a different DataType byte from DataType.TUPLE. If this is not the case, I see no way of discerning DefaultTuple from any other Tuple implementation.
          Anyway, I think this issue needs to be properly addressed in the context of PIG-1472.

          I added support for BIGCHARARRAY.

          UTF-8 decoding is quite convoluted. It is a variable length encoding, so we cannot avoid using a String. UTF-8

          Before tackling the integration with PIG-1472 I need to familiarize with the code in the patch. I will write a proposal for the integration in the next days.

          I also made some changes to DataByteArray in order to encapsulate the logic for comparison in a publicly accessible method. This way the raw comparison is consistent with the behavior of the class, in a way similar to the other cases where I delegate comparison to the class.

          Show
          Gianmarco De Francisci Morales added a comment - I added the code for "if the user does not use DefaultTuple we fall back to the default deserialization case". I assume the user defined tuple will have a different DataType byte from DataType.TUPLE. If this is not the case, I see no way of discerning DefaultTuple from any other Tuple implementation. Anyway, I think this issue needs to be properly addressed in the context of PIG-1472 . I added support for BIGCHARARRAY. UTF-8 decoding is quite convoluted. It is a variable length encoding, so we cannot avoid using a String. UTF-8 Before tackling the integration with PIG-1472 I need to familiarize with the code in the patch. I will write a proposal for the integration in the next days. I also made some changes to DataByteArray in order to encapsulate the logic for comparison in a publicly accessible method. This way the raw comparison is consistent with the behavior of the class, in a way similar to the other cases where I delegate comparison to the class.
          Hide
          Daniel Dai added a comment -

          With the change of PIG-1472, we need to change raw comparator accordingly:
          1. Bag comparison should be changed to compare TINYBAG/SMALLBAG/BAG
          2. Tuple comparison should be changed to compare TINYTUPLE/SMALLTUPLE/TUPLE
          3. Map comparison should be changed to compare TINYMAP/SMALLMAP/MAP
          4. Integer comparison should be changed to compare INTEGER_0/INTEGER_1/INTEGER_INBYTE/INTEGER_INSHORT/INTEGER
          5. ByteArray comparison should be changed to compare TINYBYTEARRAY/SMALLBYTEARRAY/BYTEARRAY
          6. Chararray comparison should be changed to compare SMALLCHARARRAY/CHARARRAY
          7. Raw comparator is now depend on the serialization format. Now we have two serialization format, DefaultTuple and BinSedesTuple. It's better to move PigTupleRawComparatorNew inside BinSedesTuple. But in this project, we only focus on BinSedesTuple, which addres most use cases

          In the integration code, we shall check if TupleFactory is actually BinSedesTupleFactory, if it is, use this raw comparator; otherwise, use the original comparator.

          I was wrong for the customized tuple. we do not need a fall back scheme for customized tuple. In the serialized format, all Tuples including customized Tuple will be serialized into the same format.

          Looks like UTF-8 encoding is convoluted, we can leave it for now.

          Show
          Daniel Dai added a comment - With the change of PIG-1472 , we need to change raw comparator accordingly: 1. Bag comparison should be changed to compare TINYBAG/SMALLBAG/BAG 2. Tuple comparison should be changed to compare TINYTUPLE/SMALLTUPLE/TUPLE 3. Map comparison should be changed to compare TINYMAP/SMALLMAP/MAP 4. Integer comparison should be changed to compare INTEGER_0/INTEGER_1/INTEGER_INBYTE/INTEGER_INSHORT/INTEGER 5. ByteArray comparison should be changed to compare TINYBYTEARRAY/SMALLBYTEARRAY/BYTEARRAY 6. Chararray comparison should be changed to compare SMALLCHARARRAY/CHARARRAY 7. Raw comparator is now depend on the serialization format. Now we have two serialization format, DefaultTuple and BinSedesTuple. It's better to move PigTupleRawComparatorNew inside BinSedesTuple. But in this project, we only focus on BinSedesTuple, which addres most use cases In the integration code, we shall check if TupleFactory is actually BinSedesTupleFactory, if it is, use this raw comparator; otherwise, use the original comparator. I was wrong for the customized tuple. we do not need a fall back scheme for customized tuple. In the serialized format, all Tuples including customized Tuple will be serialized into the same format. Looks like UTF-8 encoding is convoluted, we can leave it for now.
          Hide
          Daniel Dai added a comment -

          More clarification for custom Tuple. There two cases for custom tuple:
          1. User create custom tuple inside UDF. In this case, we do not have a special serialized format for custom tuple. After serialization, we cannot tell if it is a custom tuple. That is say, we lose track of tuple implementation after se/des. Since serialized format is the same, we can still use the same raw comparator.
          2. If user use a custom tuple factory (by overriding "pig.data.tuple.factory.name"), then serialized format may be changed. If we detect that tuple factory is not BinSedesTupleFactory, we shall not use this raw comparator.

          Show
          Daniel Dai added a comment - More clarification for custom Tuple. There two cases for custom tuple: 1. User create custom tuple inside UDF. In this case, we do not have a special serialized format for custom tuple. After serialization, we cannot tell if it is a custom tuple. That is say, we lose track of tuple implementation after se/des. Since serialized format is the same, we can still use the same raw comparator. 2. If user use a custom tuple factory (by overriding "pig.data.tuple.factory.name"), then serialized format may be changed. If we detect that tuple factory is not BinSedesTupleFactory, we shall not use this raw comparator.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I have studied PIG-1472 a bit.
          My idea is as follows:
          1) write a new method "CompareBinSedesTuple" that uses the new serialization format. The method will look like the one for default tuples that already exists.

          2) in the entry point compare(), assess which TupleFactory we are actually using (using instanceof or checking the "pig.data.tuple.factory.name" property). If it is the DefaultTupleFactory, or the BinSedesTupleFactory, use the appropriate raw comparison method, otherwise resort to deserialization

          3) actually, I would think the best place for the comparators is inside each TupleFactory. We can later split the comparator in two classes and put them in the appropriate TupleFactory implementation. We will have to add a getRawComparatorClass to the TupleFactory abstract class and modify the code that instantiates the comparator to take the class name from the TupleFactory.
          (I am not sure of this design, I am guessing on many details).

          Show
          Gianmarco De Francisci Morales added a comment - I have studied PIG-1472 a bit. My idea is as follows: 1) write a new method "CompareBinSedesTuple" that uses the new serialization format. The method will look like the one for default tuples that already exists. 2) in the entry point compare(), assess which TupleFactory we are actually using (using instanceof or checking the "pig.data.tuple.factory.name" property). If it is the DefaultTupleFactory, or the BinSedesTupleFactory, use the appropriate raw comparison method, otherwise resort to deserialization 3) actually, I would think the best place for the comparators is inside each TupleFactory. We can later split the comparator in two classes and put them in the appropriate TupleFactory implementation. We will have to add a getRawComparatorClass to the TupleFactory abstract class and modify the code that instantiates the comparator to take the class name from the TupleFactory. (I am not sure of this design, I am guessing on many details).
          Hide
          Thejas M Nair added a comment -

          This sounds fine. In case of BinSedesTupleFactory/BinSedesTuple the serialization code is in the subclass (BinInterSedes) of InterSedes class. Since the comparator function is closely tied to serlialization logic, i think that would be the appropriate place to have the comparator implementation code. The BinSedesTupleFactory can return the class obtained from InterSedesFactory.getInterSedesInstance().

          Show
          Thejas M Nair added a comment - This sounds fine. In case of BinSedesTupleFactory/BinSedesTuple the serialization code is in the subclass (BinInterSedes) of InterSedes class. Since the comparator function is closely tied to serlialization logic, i think that would be the appropriate place to have the comparator implementation code. The BinSedesTupleFactory can return the class obtained from InterSedesFactory.getInterSedesInstance().
          Hide
          Daniel Dai added a comment -

          If "pig.data.tuple.factory.name" is BinSedesTupleFactory, we should use your raw comparator. DefaultTupleFactory is a minor case, we can use deserialization.

          Show
          Daniel Dai added a comment - If "pig.data.tuple.factory.name" is BinSedesTupleFactory, we should use your raw comparator. DefaultTupleFactory is a minor case, we can use deserialization.
          Hide
          Gianmarco De Francisci Morales added a comment -

          In DataType the type bytes are sorted in such a way that the comparison between different data types yields a standard order. This is achieved by carefully assigning the byte values to the types.
          In BinInterSedes this does not happen. So, to reproduce the same order, I need to sort the bytes somehow. The easiest way is to reassign the values in a way that is coherent with DataType. The hard way would be to implement a comparison method with all the possible combinations taken into account, but this is crazy to maintain.
          I have also the same problem for costants: because for INTEGER_0/1 and BOOLEAN_TRUE/FALSE there is no value to read, and the two data type bytes are different, with the current design I need to ensure that BOOLEAN_TRUE > BOOLEAN_FALSE and INTEGER_1 > INTEGER_0.
          Furthermore, It would be good to sort the byte types so that INTEGER > INTEGER_INSHORT > INTEGER_INBYTE etc...

          Show
          Gianmarco De Francisci Morales added a comment - In DataType the type bytes are sorted in such a way that the comparison between different data types yields a standard order. This is achieved by carefully assigning the byte values to the types. In BinInterSedes this does not happen. So, to reproduce the same order, I need to sort the bytes somehow. The easiest way is to reassign the values in a way that is coherent with DataType. The hard way would be to implement a comparison method with all the possible combinations taken into account, but this is crazy to maintain. I have also the same problem for costants: because for INTEGER_0/1 and BOOLEAN_TRUE/FALSE there is no value to read, and the two data type bytes are different, with the current design I need to ensure that BOOLEAN_TRUE > BOOLEAN_FALSE and INTEGER_1 > INTEGER_0. Furthermore, It would be good to sort the byte types so that INTEGER > INTEGER_INSHORT > INTEGER_INBYTE etc...
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          I think it's better not to assume INTEGER > INTEGER_INSHORT. What type tells you is how to read the next data correctly. So if you see a INTEGER_INSHORT, read INTEGER_INSHORT into an integer, so you can compare with other integer type correctly.

          Show
          Daniel Dai added a comment - Hi, Gianmarco, I think it's better not to assume INTEGER > INTEGER_INSHORT. What type tells you is how to read the next data correctly. So if you see a INTEGER_INSHORT, read INTEGER_INSHORT into an integer, so you can compare with other integer type correctly.
          Hide
          Gianmarco De Francisci Morales added a comment -

          To follow a backwards compatible behaviour I should group all the integers (for example) into the same case statement and then implement all the logic there (if it is INTEGER_0/1 do not read anything, if it is INTEGER_INBYTE read a byte, etc...).
          So in each case statement I would need to compare each data type with its siblings, implement the logic to tell which one sorts first, then if the other data type is not a sibling, impose the global sorting.
          This will result in some quite convoluted code compared to the actual patch, because I will not be able to compare data types directly.
          Is this the desired behaviour?

          Show
          Gianmarco De Francisci Morales added a comment - To follow a backwards compatible behaviour I should group all the integers (for example) into the same case statement and then implement all the logic there (if it is INTEGER_0/1 do not read anything, if it is INTEGER_INBYTE read a byte, etc...). So in each case statement I would need to compare each data type with its siblings, implement the logic to tell which one sorts first, then if the other data type is not a sibling, impose the global sorting. This will result in some quite convoluted code compared to the actual patch, because I will not be able to compare data types directly. Is this the desired behaviour?
          Hide
          Daniel Dai added a comment -

          We need to compare a category of data type, not data type itself. For example:

          if (bt1==INTEGER_0||bt1==INTEGER_1||bt1==INTEGER_INBYTE||bt1==INTEGER_INSHORT||bt1==INTEGER) {
              type1 = INTEGER;
              value1 = readInteger(bt1);
          }
          if (bt2==INTEGER_0||bt2==INTEGER_1||bt2==INTEGER_INBYTE||bt2==INTEGER_INSHORT||bt2==INTEGER) {
              type2 = INTEGER;
              value2 = readInteger(bt1);
          }
          if (type1==type2)
              return (value1 < value2 ? -1 : (value1 == value2 ? 0 : 1));
          else {
              return dt1-dt2;
          }
          
          Show
          Daniel Dai added a comment - We need to compare a category of data type, not data type itself. For example: if (bt1==INTEGER_0||bt1==INTEGER_1||bt1==INTEGER_INBYTE||bt1==INTEGER_INSHORT||bt1==INTEGER) { type1 = INTEGER; value1 = readInteger(bt1); } if (bt2==INTEGER_0||bt2==INTEGER_1||bt2==INTEGER_INBYTE||bt2==INTEGER_INSHORT||bt2==INTEGER) { type2 = INTEGER; value2 = readInteger(bt1); } if (type1==type2) return (value1 < value2 ? -1 : (value1 == value2 ? 0 : 1)); else { return dt1-dt2; }
          Hide
          Gianmarco De Francisci Morales added a comment -

          Implemented a new compareBinInterSedesTuple() method.
          This new method works with data serialized with PIG-1472 format.
          It relies on DataType for data type comparison by translating the data types.
          I implemented some logic to read unsinged ints from bytes and shorts because I am using ByteBuffer that does not implement the DataInput interface. We might want to change that later.
          For now complex data types cause the whole tuple to be deserialized, but I have put some placeholders for methods to deserialize single complex objects and continue the normal execution flow.
          I plan to fill them in when I move the code inside BinInterSedes so that I can use the methods to read complex data types (readBag, readMap, etc..). This will involve some juggling around between DataInputBuffer and ByteBuffer (this is why we might consider to switch out ByteBuffer) to get the cursors consistent among them.

          I think the next step would be splitting the comparator and putting it into the serialization class (BinInterSedes). I don't know yet how to modify the class interface to make the comparator class available outside (but I think this relates to phase 2).

          I see no good place to put the class that implements the comparator for DefaultTuple, and Daniel said it is a minor case, so should I just throw away the code for DefaultTuple?

          Show
          Gianmarco De Francisci Morales added a comment - Implemented a new compareBinInterSedesTuple() method. This new method works with data serialized with PIG-1472 format. It relies on DataType for data type comparison by translating the data types. I implemented some logic to read unsinged ints from bytes and shorts because I am using ByteBuffer that does not implement the DataInput interface. We might want to change that later. For now complex data types cause the whole tuple to be deserialized, but I have put some placeholders for methods to deserialize single complex objects and continue the normal execution flow. I plan to fill them in when I move the code inside BinInterSedes so that I can use the methods to read complex data types (readBag, readMap, etc..). This will involve some juggling around between DataInputBuffer and ByteBuffer (this is why we might consider to switch out ByteBuffer) to get the cursors consistent among them. I think the next step would be splitting the comparator and putting it into the serialization class (BinInterSedes). I don't know yet how to modify the class interface to make the comparator class available outside (but I think this relates to phase 2). I see no good place to put the class that implements the comparator for DefaultTuple, and Daniel said it is a minor case, so should I just throw away the code for DefaultTuple?
          Hide
          Gianmarco De Francisci Morales added a comment -

          I modified the comparator to be fully recursive for arbitrarily nested data.
          There is a bit of code duplication for the fallback behaviour, I plan to clean that up later.

          I use InterSedes.readDatum() to read complex types.
          I had to modify the readWritable() method in order to return a WritableComparable. I think it should have been this way from the beginning looking at DataType.compare and given that is is dealing with a GENERIC_WRITABLECOMPARABLE as the constant says.

          I found there is no implementation of a compareTo() method for InternalMaps, nor in the class nor in DataType, so I commented that out.

          I added some tests for complex data types.

          I think this could be the final revision for phase 1, before I move the code into BinInterSedes.

          Show
          Gianmarco De Francisci Morales added a comment - I modified the comparator to be fully recursive for arbitrarily nested data. There is a bit of code duplication for the fallback behaviour, I plan to clean that up later. I use InterSedes.readDatum() to read complex types. I had to modify the readWritable() method in order to return a WritableComparable. I think it should have been this way from the beginning looking at DataType.compare and given that is is dealing with a GENERIC_WRITABLECOMPARABLE as the constant says. I found there is no implementation of a compareTo() method for InternalMaps, nor in the class nor in DataType, so I commented that out. I added some tests for complex data types. I think this could be the final revision for phase 1, before I move the code into BinInterSedes.
          Hide
          Daniel Dai added a comment -

          Patch looks pretty good. Thanks Gianmarco! Couple of comments:
          1. PigTupleRawComparatorNew:324,332,343,357,367,377,387,399,416,474,483,501,512,etc, if GeneralizedDataType is not equal, we should throw exception to contain the error
          2. PigTupleRawComparatorNew:455-464, if the comparison of two items is not equal, we shall return the result without comparing additional items, that's how we get performance gain
          3. I am unable to run TestPigTupleRawComparator.main due to OOM, what is the speed up after the change?
          4. PigTupleRawComparatorNew:132, we shall move the logic of choosing the right comparator to Pig code, and move comparator into BinSedesTuple and DefaultTuple. This is part of integration work and let's mark it as the first thing for phase 2.

          Show
          Daniel Dai added a comment - Patch looks pretty good. Thanks Gianmarco! Couple of comments: 1. PigTupleRawComparatorNew:324,332,343,357,367,377,387,399,416,474,483,501,512,etc, if GeneralizedDataType is not equal, we should throw exception to contain the error 2. PigTupleRawComparatorNew:455-464, if the comparison of two items is not equal, we shall return the result without comparing additional items, that's how we get performance gain 3. I am unable to run TestPigTupleRawComparator.main due to OOM, what is the speed up after the change? 4. PigTupleRawComparatorNew:132, we shall move the logic of choosing the right comparator to Pig code, and move comparator into BinSedesTuple and DefaultTuple. This is part of integration work and let's mark it as the first thing for phase 2.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Thanks, here my observations:

          1. If you look at the old comparator, it uses DataType.compare(). In DataType:454-458 if the two data types are not equal, the value returned is the difference between the datatypes. I retained the same behavior in the patch.
          2. I think we already do that. There is an additional guard in the for loop, that goes on only if rc == 0 on line 452

          for (int i = 0; i < tsz1 && rc == 0; i++) {
          

          3. Yes, I somehow changed the number of tuples without noticing. I got back to 10e3 and I see a 10x improvement in time now
          4. Sure!

          Show
          Gianmarco De Francisci Morales added a comment - Thanks, here my observations: 1. If you look at the old comparator, it uses DataType.compare(). In DataType:454-458 if the two data types are not equal, the value returned is the difference between the datatypes. I retained the same behavior in the patch. 2. I think we already do that. There is an additional guard in the for loop, that goes on only if rc == 0 on line 452 for ( int i = 0; i < tsz1 && rc == 0; i++) { 3. Yes, I somehow changed the number of tuples without noticing. I got back to 10e3 and I see a 10x improvement in time now 4. Sure!
          Hide
          Daniel Dai added a comment -

          Good. I will commit the patch. Let's start with integration

          Show
          Daniel Dai added a comment - Good. I will commit the patch. Let's start with integration
          Hide
          Gianmarco De Francisci Morales added a comment -

          Here my first stab at integration.
          I split the class in two classes and put them into DefaultTuple and BinInterSedes.
          This asymmetry is not nice but the serialization code is in different places for different tuples, and I wanted to keep the comparison code as close as possible to the serialization code.

          I modified the TupleFactory interface adding a method to get the tuple comparator class.
          This method is implemented in BinSedesTupleFactory and overridden in DefaultTupleFactory.

          BinSedesTupleFactory delegates it to a package method in BinInterSedes (where the actual code is).
          DefaultTupleFactoru delegates it to DefaultTuple (where the actual code is).

          The actual code for both comparators is just a cut/paste of the methods in PigTupleRawComparatorNew, I just adjusted a bit the entry points.

          I left the new comparator and tests untouched for now.

          Please let me have any comments you may have.

          Show
          Gianmarco De Francisci Morales added a comment - Here my first stab at integration. I split the class in two classes and put them into DefaultTuple and BinInterSedes. This asymmetry is not nice but the serialization code is in different places for different tuples, and I wanted to keep the comparison code as close as possible to the serialization code. I modified the TupleFactory interface adding a method to get the tuple comparator class. This method is implemented in BinSedesTupleFactory and overridden in DefaultTupleFactory. BinSedesTupleFactory delegates it to a package method in BinInterSedes (where the actual code is). DefaultTupleFactoru delegates it to DefaultTuple (where the actual code is). The actual code for both comparators is just a cut/paste of the methods in PigTupleRawComparatorNew, I just adjusted a bit the entry points. I left the new comparator and tests untouched for now. Please let me have any comments you may have.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I have some problems understanding the SecondaryKeyOptimizer.

          For example, given this query:

          
          A = LOAD 'input1' AS (a0,a1,a2);
          B = LOAD 'input2' AS (b0,b1,b2);
          C = cogroup A by (a0,a1), B by (b0,b1) parallel 2;
          D = foreach C { E = limit A 10; F = E.a2; G = DISTINCT F; generate group, COUNT(G);};
          
          

          The key type is correctly recognized to be a tuple, the so.getNumMRUseSecondaryKey() is 1, but when I get to JobControlCompiler mro.getUseSecondaryKey() is false.

          Then, when it chooses the comparator in selectComparator(), hasOrderBy , which is (mro.isGlobalSort() || mro.isLimitAfterSort() || mro.usingTypedComparator()) , is false.

          So I get into the second switch statement and I get these comparators

          case DataType.TUPLE:
          job.setSortComparatorClass(PigTupleWritableComparator.class);
          job.setGroupingComparatorClass(PigGroupingTupleWritableComparator.class);

          that, to me, look like they are already comparing tuples in raw format (they use WritableComparator.compareBytes).

          Is this because the query is one in which order semantics do not matter, so it is already optimized?
          Should it change if I add an ORDER BY somewhere before the LIMIT?
          Is this the relevant case we want to optimize?

          Show
          Gianmarco De Francisci Morales added a comment - I have some problems understanding the SecondaryKeyOptimizer. For example, given this query: A = LOAD 'input1' AS (a0,a1,a2); B = LOAD 'input2' AS (b0,b1,b2); C = cogroup A by (a0,a1), B by (b0,b1) parallel 2; D = foreach C { E = limit A 10; F = E.a2; G = DISTINCT F; generate group, COUNT(G);}; The key type is correctly recognized to be a tuple, the so.getNumMRUseSecondaryKey() is 1, but when I get to JobControlCompiler mro.getUseSecondaryKey() is false. Then, when it chooses the comparator in selectComparator(), hasOrderBy , which is (mro.isGlobalSort() || mro.isLimitAfterSort() || mro.usingTypedComparator()) , is false. So I get into the second switch statement and I get these comparators case DataType.TUPLE: job.setSortComparatorClass(PigTupleWritableComparator.class); job.setGroupingComparatorClass(PigGroupingTupleWritableComparator.class); that, to me, look like they are already comparing tuples in raw format (they use WritableComparator.compareBytes). Is this because the query is one in which order semantics do not matter, so it is already optimized? Should it change if I add an ORDER BY somewhere before the LIMIT? Is this the relevant case we want to optimize?
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          When you do an explain, you will see "Secondary sort: true" if the MR plan will use secondary sort. I am not yet sure why your script not using secondary sort, I will check it tomorrow. However, the following script will use secondary sort:

          A = LOAD '1.txt' AS (a0, a1, a2);
          B = group A by $0 parallel 2;
          C = foreach B { D = limit A 10; E = order D by $1; generate group, E;};
          explain C;
          

          When secondary sort is used, Pig will use PigSecondaryKeyComparator, which is not actually raw. We need to replace it with your raw comparator.

          Show
          Daniel Dai added a comment - Hi, Gianmarco, When you do an explain, you will see "Secondary sort: true" if the MR plan will use secondary sort. I am not yet sure why your script not using secondary sort, I will check it tomorrow. However, the following script will use secondary sort: A = LOAD '1.txt' AS (a0, a1, a2); B = group A by $0 parallel 2; C = foreach B { D = limit A 10; E = order D by $1; generate group, E;}; explain C; When secondary sort is used, Pig will use PigSecondaryKeyComparator, which is not actually raw. We need to replace it with your raw comparator.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Thanks for the suggestion Daniel.

          My idea to integrate the RawTupleComparator is to modify PigSecondaryKeyComparator to delegate the comparison to it. I cannot mimic the current behavior of making 2 calls (one for the main and one for the secondary key) because I do not know the boundaries between the keys. So I need to make a single call to compare the compound key. There are some issues though.

          1) There are some inconsistencies between the behavior of PigTupleRawComparator (the original one) and PigSecondaryKeyComparator.
          Specifically, when tuples are null the first one returns 0 while the second one compares the indexes.
          Furthermore, indexes are compared also when one of the fields in the tuples is null, in order not to join them (if I understood correctly PIG-927). This is done in PigSecondaryKeyComparator but not in PigTupleRawComparator.
          Is it designed to be like this or is it a bug?
          I suppose the behaviors of the two comparators should be more or less the same.

          2) In PigSecondaryKeyComparator the key is assumed to be a 2-field tuple where the 0th field is the main key and the 1st field is the secondary key.
          We can directly feed the binary representation of this tuple inside our new raw comparator, but we need to consolidate the sort orders. Right now there are 2 different and independent sort orders serialized in the jobConf (pig.sortOrder and pig.secondarySortOrder). In the simple case, when sort orders for all the columns are specified, we can just concatenate them together (sort of). There are some problems when we have WholeTuple sort orders as they might differ.
          I would like to keep all of this out of the tuple comparator and define some clean interface to pass the sort orders. One problem I see is that I probably need the tuple sizes (recursively) to do this, and this is not known at configuration time. I also need to fix this in the current comparator, in order to take into account the recursion inside nested tuples.

          3) Should I keep all the mIndex/mNull handling outside the RawTupleComparator and write wrappers that deal with them?
          That is, should the RawTupleComparator know how to deal with a NullableTuple or should it just know its kind of Tuple (BinInterSedes or Default)

          Show
          Gianmarco De Francisci Morales added a comment - Thanks for the suggestion Daniel. My idea to integrate the RawTupleComparator is to modify PigSecondaryKeyComparator to delegate the comparison to it. I cannot mimic the current behavior of making 2 calls (one for the main and one for the secondary key) because I do not know the boundaries between the keys. So I need to make a single call to compare the compound key. There are some issues though. 1) There are some inconsistencies between the behavior of PigTupleRawComparator (the original one) and PigSecondaryKeyComparator. Specifically, when tuples are null the first one returns 0 while the second one compares the indexes. Furthermore, indexes are compared also when one of the fields in the tuples is null, in order not to join them (if I understood correctly PIG-927 ). This is done in PigSecondaryKeyComparator but not in PigTupleRawComparator. Is it designed to be like this or is it a bug? I suppose the behaviors of the two comparators should be more or less the same. 2) In PigSecondaryKeyComparator the key is assumed to be a 2-field tuple where the 0th field is the main key and the 1st field is the secondary key. We can directly feed the binary representation of this tuple inside our new raw comparator, but we need to consolidate the sort orders. Right now there are 2 different and independent sort orders serialized in the jobConf (pig.sortOrder and pig.secondarySortOrder). In the simple case, when sort orders for all the columns are specified, we can just concatenate them together (sort of). There are some problems when we have WholeTuple sort orders as they might differ. I would like to keep all of this out of the tuple comparator and define some clean interface to pass the sort orders. One problem I see is that I probably need the tuple sizes (recursively) to do this, and this is not known at configuration time. I also need to fix this in the current comparator, in order to take into account the recursion inside nested tuples. 3) Should I keep all the mIndex/mNull handling outside the RawTupleComparator and write wrappers that deal with them? That is, should the RawTupleComparator know how to deal with a NullableTuple or should it just know its kind of Tuple (BinInterSedes or Default)
          Hide
          Daniel Dai added a comment -

          Hi, Gianmarco,
          1. Notice currently PigTupleRawComparator is only used in "order-by" job, PigSecondaryKeyComparator only used by "non-order-by" job, there is a semantic difference in null handling. In order-by, different null key are treated equal; In group-by, however, null from different relation cannot merge (this is to conform with SQL standard). In your case, RawTupleComparator can be used in both "order-by tuple" case and "non-order-by" with secondary key case. So there will be semantic gap between this two. We need to deal with it separately. For simplicity, we can focus on secondary key case first (which means, different null keys from different relations are not equal)

          2. We need to deal with sort order. When we use PigTupleRawComparator for secondary sort, it's quite clear the structure will be (main_key, secondary_key, value). I think we can limit PigTupleRawComparator specific to this structure currently (don't think about order-by tuple case), so you can pass the sort order and deal with it in PigTupleRawComparator

          3. Yes, I think it might be better to keep mIndex/mNull handling outside RawTupleComparator

          Show
          Daniel Dai added a comment - Hi, Gianmarco, 1. Notice currently PigTupleRawComparator is only used in "order-by" job, PigSecondaryKeyComparator only used by "non-order-by" job, there is a semantic difference in null handling. In order-by, different null key are treated equal; In group-by, however, null from different relation cannot merge (this is to conform with SQL standard). In your case, RawTupleComparator can be used in both "order-by tuple" case and "non-order-by" with secondary key case. So there will be semantic gap between this two. We need to deal with it separately. For simplicity, we can focus on secondary key case first (which means, different null keys from different relations are not equal) 2. We need to deal with sort order. When we use PigTupleRawComparator for secondary sort, it's quite clear the structure will be (main_key, secondary_key, value). I think we can limit PigTupleRawComparator specific to this structure currently (don't think about order-by tuple case), so you can pass the sort order and deal with it in PigTupleRawComparator 3. Yes, I think it might be better to keep mIndex/mNull handling outside RawTupleComparator
          Hide
          Gianmarco De Francisci Morales added a comment -

          Ok, first working integration.
          Modified PigTupleRawComparatorNew to use the raw comparators via TupleFactory.
          Created a new class PigSecondaryKeyComparatorNew that should substitute the old one. This one uses the raw comparators.
          Modified JobControlCompiler to use the new comparators.

          Moved the null/index semantic outside the raw comparators and inside the wrappers.

          Modified BinSedesTupleComparator to correctly handle sort order. The sort order is applied to the first call to compare tuples. In case we are doing a secondary sort, the sort orders are propagated 1 level more (because we have a nested tuple with the keys, and we need to apply the sort orders to the content of the outermost tuple).
          The code is not the cleanest possible but TestPigTupleRawComparator and TestSecondarySort pass.

          TODO:
          Implement the logic for PIG-927.
          I plan to create a new interface (TupleRawComparator) and add a method to check if during the comparison a field of type NULL was encountered. This interface will be used instead of the simple RawComparator to hold the reference to our raw comparators.

          Write speed test.
          Is there something already made that can be used to test the speed improvement? The inputs for the unit test are of course too small.

          Show
          Gianmarco De Francisci Morales added a comment - Ok, first working integration. Modified PigTupleRawComparatorNew to use the raw comparators via TupleFactory. Created a new class PigSecondaryKeyComparatorNew that should substitute the old one. This one uses the raw comparators. Modified JobControlCompiler to use the new comparators. Moved the null/index semantic outside the raw comparators and inside the wrappers. Modified BinSedesTupleComparator to correctly handle sort order. The sort order is applied to the first call to compare tuples. In case we are doing a secondary sort, the sort orders are propagated 1 level more (because we have a nested tuple with the keys, and we need to apply the sort orders to the content of the outermost tuple). The code is not the cleanest possible but TestPigTupleRawComparator and TestSecondarySort pass. TODO: Implement the logic for PIG-927 . I plan to create a new interface (TupleRawComparator) and add a method to check if during the comparison a field of type NULL was encountered. This interface will be used instead of the simple RawComparator to hold the reference to our raw comparators. Write speed test. Is there something already made that can be used to test the speed improvement? The inputs for the unit test are of course too small.
          Hide
          Gianmarco De Francisci Morales added a comment -

          Fixed issues for PIG-927 as planned.
          Fixed a couple of bugs that came up with testing.
          Aligned the code path for the deserialization version.
          Ran full test suite, all tests pass.

          TODO:
          Align code for DefaultTuple (low priority).
          Speed tests.

          Show
          Gianmarco De Francisci Morales added a comment - Fixed issues for PIG-927 as planned. Fixed a couple of bugs that came up with testing. Aligned the code path for the deserialization version. Ran full test suite, all tests pass. TODO: Align code for DefaultTuple (low priority). Speed tests.
          Hide
          Gianmarco De Francisci Morales added a comment -

          I performed a some speed tests using PigMix2.
          I used query L16 and generated 2 datasets: one with 1M rows (1.6GiB) and the other with 10M rows (16GiB).
          I took the times end to end using the "time" utility.
          I do not have a cluster so I ran pig locally.
          Here the results.

          Trunk 1M 0m53.469s
          Patched 1M 0m39.076s
          Trunk 10M 9m49.507s
          Patched 10M 8m0.048s

          We have a 20~30% improvement end-to-end for this query.
          I think this is consistent with the expectations.

          Show
          Gianmarco De Francisci Morales added a comment - I performed a some speed tests using PigMix2. I used query L16 and generated 2 datasets: one with 1M rows (1.6GiB) and the other with 10M rows (16GiB). I took the times end to end using the "time" utility. I do not have a cluster so I ran pig locally. Here the results. Trunk 1M 0m53.469s Patched 1M 0m39.076s Trunk 10M 9m49.507s Patched 10M 8m0.048s We have a 20~30% improvement end-to-end for this query. I think this is consistent with the expectations.
          Hide
          Daniel Dai added a comment -

          That is terrific! I will review your patch shortly.

          Show
          Daniel Dai added a comment - That is terrific! I will review your patch shortly.
          Hide
          Daniel Dai added a comment -

          I reviewed and regenerate the patch. Couple of notes:
          1. All unit test and end-to-end test pass, hudson warning are addressed
          2. See consistent performance improvement (around 20%) in pigmix query L16 (using 10 reducers, on a cluster of 10 nodes)
          3. Did some refactory, change some class names and move some code around, move getRawComparatorClass to Tuple instead of TupleFactory

          Gianmarco, can you take a look if my changes are good?

          Show
          Daniel Dai added a comment - I reviewed and regenerate the patch. Couple of notes: 1. All unit test and end-to-end test pass, hudson warning are addressed 2. See consistent performance improvement (around 20%) in pigmix query L16 (using 10 reducers, on a cluster of 10 nodes) 3. Did some refactory, change some class names and move some code around, move getRawComparatorClass to Tuple instead of TupleFactory Gianmarco, can you take a look if my changes are good?
          Hide
          Thejas M Nair added a comment -

          Comments about PIG-1295_0.14.patch -

          • The comparison logic for BinInterSedes relies on the serliazation format, so it think its better to have it closer to where the serialization format is implemented. Ie add a function to InterSedes interface (getComparator() ?) , and move the implementation logic to BinInterSedes class.
          • I think TupleFactory is a better place for getRawComparatorClass() for the following reasons-
            • TupleFactory is a singleton class, Tuple is not. Having it in Tuple implies that you can have different values returned by different instances.
            • Adding it to Tuple interface breaks backward compatibility, all Tuple implementations will need to add this function. Also, does not make sense for load functions that return a custom tuple to implement this method, because it is not related to that tuple implementation.
          Show
          Thejas M Nair added a comment - Comments about PIG-1295 _0.14.patch - The comparison logic for BinInterSedes relies on the serliazation format, so it think its better to have it closer to where the serialization format is implemented. Ie add a function to InterSedes interface (getComparator() ?) , and move the implementation logic to BinInterSedes class. I think TupleFactory is a better place for getRawComparatorClass() for the following reasons- TupleFactory is a singleton class, Tuple is not. Having it in Tuple implies that you can have different values returned by different instances. Adding it to Tuple interface breaks backward compatibility, all Tuple implementations will need to add this function. Also, does not make sense for load functions that return a custom tuple to implement this method, because it is not related to that tuple implementation.
          Hide
          Daniel Dai added a comment -

          Response to Thejas:
          1. Yes, you are right. I will put another layer of abstraction for InterSedes.getRawComparatorClass
          2. Conceptually comparator is in the logic of Tuple. Ideally it should be a static method of Tuple, however Tuple interface do not allow me do that. But even this I still feel it's better to put it in Tuple. For backward compatibility, first, we will break either Tuple or TupleFactory, the impact is equivalent; second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we will check if Tuple does not implement the new method, we fall back to the default serialize version. Thoughts?

          Show
          Daniel Dai added a comment - Response to Thejas: 1. Yes, you are right. I will put another layer of abstraction for InterSedes.getRawComparatorClass 2. Conceptually comparator is in the logic of Tuple. Ideally it should be a static method of Tuple, however Tuple interface do not allow me do that. But even this I still feel it's better to put it in Tuple. For backward compatibility, first, we will break either Tuple or TupleFactory, the impact is equivalent; second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we will check if Tuple does not implement the new method, we fall back to the default serialize version. Thoughts?
          Hide
          Thejas M Nair added a comment -

          Conceptually comparator is in the logic of Tuple.

          This comparator is part of only the default tuple implementation used internally within pig. So the class that is the source of truth for the default internal tuple implementation seems a good place to have this function. A tuple returned by a loadfunction has nothing to do with the comparator logic.

          Ideally it should be a static method of Tuple, however Tuple interface do not allow me do that.

          Yes, a static method can't be overridden. Since this is supposed to return only one value per pig query, the singleton TupleFactory is a better place.

          For backward compatibility, first, we will break either Tuple or TupleFactory, the impact is equivalent;

          No. TupleFactory is an abstract class, while Tuple is an interface. Users will not be forced to change their implementation if we add a function to TupleFactory. Also, users are more likely to have custom Tuple than custom TupleFactory - because they might implement different tuples as part of their load function implementation, and are unlikely to change the default Tuple implementation used in internally in pig.

          second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we will check if Tuple does not implement the new method, we fall back to the default serialize version.

          If Tuple interface is going to have this function, i think we should add in the javadoc that it makes sense to implement the function only if it is going to be used as the default internal tuple implementation. And that
          null value can be returned if user chooses to not implement it.

          Show
          Thejas M Nair added a comment - Conceptually comparator is in the logic of Tuple. This comparator is part of only the default tuple implementation used internally within pig. So the class that is the source of truth for the default internal tuple implementation seems a good place to have this function. A tuple returned by a loadfunction has nothing to do with the comparator logic. Ideally it should be a static method of Tuple, however Tuple interface do not allow me do that. Yes, a static method can't be overridden. Since this is supposed to return only one value per pig query, the singleton TupleFactory is a better place. For backward compatibility, first, we will break either Tuple or TupleFactory, the impact is equivalent; No. TupleFactory is an abstract class, while Tuple is an interface. Users will not be forced to change their implementation if we add a function to TupleFactory. Also, users are more likely to have custom Tuple than custom TupleFactory - because they might implement different tuples as part of their load function implementation, and are unlikely to change the default Tuple implementation used in internally in pig. second, in both PigSecondaryKeyComparator and PigTupleSortComparator, we will check if Tuple does not implement the new method, we fall back to the default serialize version. If Tuple interface is going to have this function, i think we should add in the javadoc that it makes sense to implement the function only if it is going to be used as the default internal tuple implementation. And that null value can be returned if user chooses to not implement it.
          Hide
          Daniel Dai added a comment -

          Attach another patch to address Thejas's first point.

          Show
          Daniel Dai added a comment - Attach another patch to address Thejas's first point.
          Hide
          Gianmarco De Francisci Morales added a comment -

          My 2 cents on the issue.

          1) I agree with Thejas that comparator logic is strictly tied to serialization, so they should be as close as possible.

          2) I agree that comparator is something related to Tuple, but Tuple is an interface and this complicates things. Putting the method to access the comparator in TupleFactory seems more natural to me, as the Factory and the Tuple implementation are anyway strongly tied.
          I don't like to have to create a (useless) Tuple in order to get to the comparator class.

          Class<? extends TupleRawComparator> mComparatorClass = TupleFactory.getInstance().newTuple().getRawComparatorClass();
          

          Other minor things:

          In PigSecondaryKeyComparator@54

                  if (mComparator==null) {
                      try {
                          mComparator = PigTupleDefaultRawComparator.class.newInstance();
                      } catch (InstantiationException e) {
                          throw new RuntimeException(e);
                      } catch (IllegalAccessException e) {
                          throw new RuntimeException(e);
                      }
                  }
                  ((Configurable)mComparator).setConf(jconf);
          

          We can directly instantiate the class instead of using reflection here. Furthermore, there is no need to cast mComparator to Configurable.

                 if (mComparator==null) 
                          mComparator = new PigTupleDefaultRawComparator();
                  mComparator.setConf(jconf);
          
          Show
          Gianmarco De Francisci Morales added a comment - My 2 cents on the issue. 1) I agree with Thejas that comparator logic is strictly tied to serialization, so they should be as close as possible. 2) I agree that comparator is something related to Tuple, but Tuple is an interface and this complicates things. Putting the method to access the comparator in TupleFactory seems more natural to me, as the Factory and the Tuple implementation are anyway strongly tied. I don't like to have to create a (useless) Tuple in order to get to the comparator class. Class <? extends TupleRawComparator> mComparatorClass = TupleFactory.getInstance().newTuple().getRawComparatorClass(); Other minor things: In PigSecondaryKeyComparator@54 if (mComparator== null ) { try { mComparator = PigTupleDefaultRawComparator.class.newInstance(); } catch (InstantiationException e) { throw new RuntimeException(e); } catch (IllegalAccessException e) { throw new RuntimeException(e); } } ((Configurable)mComparator).setConf(jconf); We can directly instantiate the class instead of using reflection here. Furthermore, there is no need to cast mComparator to Configurable. if (mComparator== null ) mComparator = new PigTupleDefaultRawComparator(); mComparator.setConf(jconf);
          Hide
          Daniel Dai added a comment -

          Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach PIG-1295_0.16.patch.

          Show
          Daniel Dai added a comment - Discuss with Alan, we agree to put getRawComparator into TupleFactory. Attach PIG-1295 _0.16.patch.
          Hide
          Daniel Dai added a comment -

          All unit test pass.

          test-patch:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 8 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] -1 javac. The applied patch generated 156 javac compiler warnings (more than the trunk's current 145 warnings).
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 414 release audit warnings (more than the trunk's current 410 warnings).

          javac warnings is all about deprecate. For release audit warnings, I checked every new file have license header.

          Patch committed. Thanks Gianmarco!

          Show
          Daniel Dai added a comment - All unit test pass. test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 8 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 156 javac compiler warnings (more than the trunk's current 145 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 414 release audit warnings (more than the trunk's current 410 warnings). javac warnings is all about deprecate. For release audit warnings, I checked every new file have license header. Patch committed. Thanks Gianmarco!
          Hide
          Gianmarco De Francisci Morales added a comment -

          Thanks for your support Daniel.
          It was a great experience.

          Show
          Gianmarco De Francisci Morales added a comment - Thanks for your support Daniel. It was a great experience.

            People

            • Assignee:
              Gianmarco De Francisci Morales
              Reporter:
              Daniel Dai
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development