Pig
  1. Pig
  2. PIG-202

ComparatorFunc provided to ORDER clause is not always honoured

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      Specifying a comparator function is acknowledge neither by local implementation, nor by quartile lookup job.

      Patch coming soon.

      1. TestOderBy.patch
        2 kB
        Mathieu Poumeyrol
      2. Sort.v2.patch
        9 kB
        Mathieu Poumeyrol
      3. Sort.patch
        9 kB
        Mathieu Poumeyrol
      4. quantiles.pig
        0.1 kB
        Mathieu Poumeyrol
      5. quantiles.in
        0.3 kB
        Mathieu Poumeyrol
      6. MapreducePlanCompiler.patch
        0.9 kB
        Mathieu Poumeyrol
      7. InstantiateFunc.patch
        6 kB
        Mathieu Poumeyrol
      8. EvalSpec.patch
        1.0 kB
        Mathieu Poumeyrol

        Activity

        Hide
        Mathieu Poumeyrol added a comment -

        This patch makes testOrderBy test both LOCAL and MAPREDUCE mode.

        Local attempt on testTopLevelOrderBy_*_Using are failing.

        Show
        Mathieu Poumeyrol added a comment - This patch makes testOrderBy test both LOCAL and MAPREDUCE mode. Local attempt on testTopLevelOrderBy_*_Using are failing.
        Hide
        Mathieu Poumeyrol added a comment -

        EvalSpec.patch is fix a first issue where comparatorFunction is not created and default implementation is used where it should not.

        But I'm struggling with the fact that in local mode, the comparator is given the whole tuple instead of the relevant datum ($2 for instance) for the tuple.

        I guess I won't be able to do much more by myself on the localmode issue.

        Show
        Mathieu Poumeyrol added a comment - EvalSpec.patch is fix a first issue where comparatorFunction is not created and default implementation is used where it should not. But I'm struggling with the fact that in local mode, the comparator is given the whole tuple instead of the relevant datum ($2 for instance) for the tuple. I guess I won't be able to do much more by myself on the localmode issue.
        Hide
        Mathieu Poumeyrol added a comment -

        MapreduceCompiler patch puts the same compartor in the quantile job than in the sort job. I just feel like it makes more sense, but I might be totally wrong here.

        Show
        Mathieu Poumeyrol added a comment - MapreduceCompiler patch puts the same compartor in the quantile job than in the sort job. I just feel like it makes more sense, but I might be totally wrong here.
        Hide
        Mathieu Poumeyrol added a comment -

        After struggling a bit with the local mode evaluator, I found the read issue. The code I was looking for was actualy in instantiateFunc() method, and my previous patch (1) a useless and incomplete duplication of this code.

        Instead I tried to make sure instantiateFunc() is called on all EvalSpec nodes prior to execution attempts. I now run a POVisitor in LocalExecutionEngine.execute() before pp.open(). It looks much better as the 16 tests from testOrderBy are now working in both local and mapreduce mode.

        The patch is not ready yet (the visitor only visit what my tests needs it to visit) and there is still come cleanup and commenting to do.

        Any comment welcome.

        By the way, the MapReducePlanCompiler patch is a piece of junk, but I'm still convinced there is a severe performance issue here.

        Show
        Mathieu Poumeyrol added a comment - After struggling a bit with the local mode evaluator, I found the read issue. The code I was looking for was actualy in instantiateFunc() method, and my previous patch (1) a useless and incomplete duplication of this code. Instead I tried to make sure instantiateFunc() is called on all EvalSpec nodes prior to execution attempts. I now run a POVisitor in LocalExecutionEngine.execute() before pp.open(). It looks much better as the 16 tests from testOrderBy are now working in both local and mapreduce mode. The patch is not ready yet (the visitor only visit what my tests needs it to visit) and there is still come cleanup and commenting to do. Any comment welcome. By the way, the MapReducePlanCompiler patch is a piece of junk, but I'm still convinced there is a severe performance issue here.
        Hide
        Mathieu Poumeyrol added a comment -

        Patch 4 introduces a POVisitor that call instantiateFunc on all EvalSpec of PO nodes. It is called by the LogicalExecutionEngine just before execution.

        Show
        Mathieu Poumeyrol added a comment - Patch 4 introduces a POVisitor that call instantiateFunc on all EvalSpec of PO nodes. It is called by the LogicalExecutionEngine just before execution.
        Hide
        Mathieu Poumeyrol added a comment -

        Please review InstantiateFunc patch. TestOrderBy may be worth a look too. The rest is junk.

        Show
        Mathieu Poumeyrol added a comment - Please review InstantiateFunc patch. TestOrderBy may be worth a look too. The rest is junk.
        Hide
        Pi Song added a comment -

        Mathieu,

        Welcome in Pig community. I appreciate your attempt to help us. My real work is killing me and seems other people are busy too. Please don't give up. We will have a look at your patches as soon as possible (Latest will be on the weekend).

        Cheers,
        Pi

        Show
        Pi Song added a comment - Mathieu, Welcome in Pig community. I appreciate your attempt to help us. My real work is killing me and seems other people are busy too. Please don't give up. We will have a look at your patches as soon as possible (Latest will be on the weekend). Cheers, Pi
        Hide
        Pi Song added a comment -

        Mathieu,

        Ensuring all the specs instantiating what needed before use using a visitor pattern is a good idea but just to make the code clean we should remove code where those stuffs were previously instantiated as well.

        For the MapreducePlanCompiler thing, you've got very good eyes. Now I also tempted to add that comparator instantiation. Can you prove that at the moment the quantilejob doesn't really sort based on the given function? Possibly a small unit test would be enough.

        In regard to the test, currently we only test either MapReduce or Local at a time. That's why a bug such like this still occurs. In the mean time, I think you should send an email to the mailing list so we can find a global solution for all the tests together.

        PS. Could you please combine all the needed patches? Too many fragments will make it difficult for our already busy committers.

        Show
        Pi Song added a comment - Mathieu, Ensuring all the specs instantiating what needed before use using a visitor pattern is a good idea but just to make the code clean we should remove code where those stuffs were previously instantiated as well. For the MapreducePlanCompiler thing, you've got very good eyes. Now I also tempted to add that comparator instantiation. Can you prove that at the moment the quantilejob doesn't really sort based on the given function? Possibly a small unit test would be enough. In regard to the test, currently we only test either MapReduce or Local at a time. That's why a bug such like this still occurs. In the mean time, I think you should send an email to the mailing list so we can find a global solution for all the tests together. PS. Could you please combine all the needed patches? Too many fragments will make it difficult for our already busy committers.
        Hide
        Mathieu Poumeyrol added a comment -

        As requested, a all-in-one patch (Sort.patch) that:

        • call instantiateFunc on PO before the actual execution (fix using clause in local context)
        • discard the only one "late" comparator instantiation I could found (made redundant, dead code)
        • correct a marginal biais in the findQuantile builtin function (one of the two extremum quantile was bigger or smaller depending on truncation)
        • fix quantile job.

        The quantile job issue is tricky. It is not easy to show how it misbehaves with a pig unit test, as the result is correct... FindQuantiles is responsible for defining a partition of the intermediary keyspace. Hadoop uses this partition through a SortPartitioner instance to split the reduce half of the Sort job among several reduce tasks. Now the FindQuartiles were using a StarSpec as a comparator, whereas SortPartitioner were using the UDF comparator to perform a Arrays.binarySearch. The binary search can not work correctly in these conditions, and this leads to widely unbalanced reduce tasks as most of the keys fall in the same partition.

        "Prooving" this point actualy required counting how many items go to which partition in SortPartitioner (some printf-like debugging). But honestly, I think the patch just makes a lot of sense.

        The fix just provides the UDF compartor to the sort used internaly by the findQuartile job.

        Show
        Mathieu Poumeyrol added a comment - As requested, a all-in-one patch (Sort.patch) that: call instantiateFunc on PO before the actual execution (fix using clause in local context) discard the only one "late" comparator instantiation I could found (made redundant, dead code) correct a marginal biais in the findQuantile builtin function (one of the two extremum quantile was bigger or smaller depending on truncation) fix quantile job. The quantile job issue is tricky. It is not easy to show how it misbehaves with a pig unit test, as the result is correct... FindQuantiles is responsible for defining a partition of the intermediary keyspace. Hadoop uses this partition through a SortPartitioner instance to split the reduce half of the Sort job among several reduce tasks. Now the FindQuartiles were using a StarSpec as a comparator, whereas SortPartitioner were using the UDF comparator to perform a Arrays.binarySearch. The binary search can not work correctly in these conditions, and this leads to widely unbalanced reduce tasks as most of the keys fall in the same partition. "Prooving" this point actualy required counting how many items go to which partition in SortPartitioner (some printf-like debugging). But honestly, I think the patch just makes a lot of sense. The fix just provides the UDF compartor to the sort used internaly by the findQuartile job.
        Hide
        Mathieu Poumeyrol added a comment -

        Well, well, well...

        New delivery of the all-in-one sort patch (Sort.v2.patch). There was still a glitch in my findQuantileJob version, but I think I got it right this time (at least as much as I did before...).

        By the way, I'm starting to worry about the absence of attention or interest from the commiters on this issue. Am I struggling with some piece of code about to be superseded by physical and/or logical layers refactoring ? Or is it the fact that efficient sort operations are critical in what i'm trying to do with pig that make me perceive this as more critical than it is ?

        Show
        Mathieu Poumeyrol added a comment - Well, well, well... New delivery of the all-in-one sort patch (Sort.v2.patch). There was still a glitch in my findQuantileJob version, but I think I got it right this time (at least as much as I did before...). By the way, I'm starting to worry about the absence of attention or interest from the commiters on this issue. Am I struggling with some piece of code about to be superseded by physical and/or logical layers refactoring ? Or is it the fact that efficient sort operations are critical in what i'm trying to do with pig that make me perceive this as more critical than it is ?
        Hide
        Alan Gates added a comment -

        A couple of comments/questions:

        1. I don't understand your changes to FindQuantiles. If I read the code correctly, its now taking the first x values out of the bag, instead of sampling at regular intervals from the whole bag. This has the advantage of not needing to read the whole bag (though the code isn't taking advantage of that), but it will give a much worse sample, at least if the bag is ordered. Am I missing something? Should it be that we take the first n values and then break if the bag is unordered, and every 1/n values if the bag is ordered?
        2. Have you done any performance testing to get an idea of the speed up this gives? Obviously that will depend on the data set, but it would be interesting to see.

        FWIW, I haven't been ignoring your work on this. It seemed you were making good progress and getting feedback from Pi, so I hadn't jumped in yet.

        Show
        Alan Gates added a comment - A couple of comments/questions: I don't understand your changes to FindQuantiles. If I read the code correctly, its now taking the first x values out of the bag, instead of sampling at regular intervals from the whole bag. This has the advantage of not needing to read the whole bag (though the code isn't taking advantage of that), but it will give a much worse sample, at least if the bag is ordered. Am I missing something? Should it be that we take the first n values and then break if the bag is unordered, and every 1/n values if the bag is ordered? Have you done any performance testing to get an idea of the speed up this gives? Obviously that will depend on the data set, but it would be interesting to see. FWIW, I haven't been ignoring your work on this. It seemed you were making good progress and getting feedback from Pi, so I hadn't jumped in yet.
        Hide
        Mathieu Poumeyrol added a comment -

        Thanks for having confirmed I was not wasting my time.

        1. This is not what I'm trying to do. The current implementation, when asked for 9 quantiles among 100 elements (0..99) returns this:
        (all,

        {(0), (12), (24), (36), (48), (60), (72), (84), (96)}

        )
        This does not lead to a good partition. The first part is empty or so, the last part is smaller than the "central" parts.

        Sort.patch and Sort.v2.patch change FindQuantiles to make it return:
        (all,

        {(11), (21), (31), (41), (51), (61), (71), (81), (91)}

        )

        It looks better. Actualy, it looks even nicer with a <= instead of < in the big if in the loop...
        (all,

        {(10), (20), (30), (40), (50), (60), (70), (80), (90)}

        )

        The impact on sort performance of this fix in FindQuantiles is probably marginal. it just avoids some empty or smaller reduce jobs. But it gives better quantiles to a end user trying to use the function.

        2. I will try to run and time my test job over the weekend. The performance killer was not the small glitch in FindQuantiles, but the fact that the SortPartitioner's and the quantiles' comparator were not consistent. I'll try to give you some figures.

        3. I will also generate a Sort.v3.patch (with the <= in FindQuantiles) using svn diff as eclipse tends to generates ugly patches with absolute paths.

        Show
        Mathieu Poumeyrol added a comment - Thanks for having confirmed I was not wasting my time. 1. This is not what I'm trying to do. The current implementation, when asked for 9 quantiles among 100 elements (0..99) returns this: (all, {(0), (12), (24), (36), (48), (60), (72), (84), (96)} ) This does not lead to a good partition. The first part is empty or so, the last part is smaller than the "central" parts. Sort.patch and Sort.v2.patch change FindQuantiles to make it return: (all, {(11), (21), (31), (41), (51), (61), (71), (81), (91)} ) It looks better. Actualy, it looks even nicer with a <= instead of < in the big if in the loop... (all, {(10), (20), (30), (40), (50), (60), (70), (80), (90)} ) The impact on sort performance of this fix in FindQuantiles is probably marginal. it just avoids some empty or smaller reduce jobs. But it gives better quantiles to a end user trying to use the function. 2. I will try to run and time my test job over the weekend. The performance killer was not the small glitch in FindQuantiles, but the fact that the SortPartitioner's and the quantiles' comparator were not consistent. I'll try to give you some figures. 3. I will also generate a Sort.v3.patch (with the <= in FindQuantiles) using svn diff as eclipse tends to generates ugly patches with absolute paths.
        Hide
        Mathieu Poumeyrol added a comment -

        The test files I used to generates the quantiles in my previous comment.

        Show
        Mathieu Poumeyrol added a comment - The test files I used to generates the quantiles in my previous comment.
        Hide
        Pi Song added a comment -

        1) The first quantile always being zero is something really has to be fixed.
        2) The <= to < thing also helps a (very tiny) bit more to make the first bucket the same size as others.

        Theoretically, there should be some small performance gain due to (1)

        I'm looking forward to seeing the result!!!

        Show
        Pi Song added a comment - 1) The first quantile always being zero is something really has to be fixed. 2) The <= to < thing also helps a (very tiny) bit more to make the first bucket the same size as others. Theoretically, there should be some small performance gain due to (1) I'm looking forward to seeing the result!!!
        Hide
        Mathieu Poumeyrol added a comment -

        Just a short update to say I'm alive, but still crunching numbers on my cluster. And trying to make sense of what is hapenning : it looks like I never get balanced partitions, whether or not I specify a UDF comparator, with or without my patches...

        Anyway, still digging.

        Show
        Mathieu Poumeyrol added a comment - Just a short update to say I'm alive, but still crunching numbers on my cluster. And trying to make sense of what is hapenning : it looks like I never get balanced partitions, whether or not I specify a UDF comparator, with or without my patches... Anyway, still digging.
        Hide
        Pi Song added a comment - - edited

        +1 on Sort.v2.patch. This fixes an obvious bug in quantile bucketing. It would be good to see the benchmark but not necessary.

        Show
        Pi Song added a comment - - edited +1 on Sort.v2.patch. This fixes an obvious bug in quantile bucketing. It would be good to see the benchmark but not necessary.
        Hide
        Alan Gates added a comment -

        Fixed checked in at revision 654172. I made one small change, which was to change the name of InsantiateFuncCallerPOVisitor to InstantiateFuncCallerPOVisitor. Thanks Mathieu for your work on this.

        Show
        Alan Gates added a comment - Fixed checked in at revision 654172. I made one small change, which was to change the name of InsantiateFuncCallerPOVisitor to InstantiateFuncCallerPOVisitor. Thanks Mathieu for your work on this.

          People

          • Assignee:
            Mathieu Poumeyrol
            Reporter:
            Mathieu Poumeyrol
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development