Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: tez-branch
    • Fix Version/s: tez-branch
    • Component/s: tez
    • Labels:
      None

      Description

      Implement optimizations for LIMIT when other parts of Pig-on-Tez are more mature. Some of the optimizations mentioned by Daniel include:

      1. If the previous stage using 1 reduce, no need to add one more vertex
      2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible
      3. Similar to PIG-1270, we can push limit to InputHandler
      4. We also need to think through the "limited order by" case once "order by" is implemented

        Activity

        Hide
        cheolsoo Cheolsoo Park added a comment -

        Committed to tez. Thank you Alex!

        Show
        cheolsoo Cheolsoo Park added a comment - Committed to tez. Thank you Alex!
        Hide
        abain Alex Bain added a comment -
        Show
        abain Alex Bain added a comment - ReviewBoard at https://reviews.apache.org/r/16926/
        Hide
        abain Alex Bain added a comment -

        This makes sense to me - thanks for the comment.

        Show
        abain Alex Bain added a comment - This makes sense to me - thanks for the comment.
        Hide
        daijy Daniel Dai added a comment -

        Yes, in case of the root vertex (vertex contains load), the parallelism is determined by InputFormat not requestedParallelism, and it cannot be determined in compile time. We will need to do a second limit only vertex in this case. For non-root vertex however, we can use requestedParallelism as a criteria to determine whether or not we need a follow up vertex for limit.

        Show
        daijy Daniel Dai added a comment - Yes, in case of the root vertex (vertex contains load), the parallelism is determined by InputFormat not requestedParallelism, and it cannot be determined in compile time. We will need to do a second limit only vertex in this case. For non-root vertex however, we can use requestedParallelism as a criteria to determine whether or not we need a follow up vertex for limit.
        Hide
        abain Alex Bain added a comment -

        1. You can check requestedParallelism for the tezOperator. This should be doable.

        This doesn't sound quite right to me. Let's say you are doing:
        a = LOAD '/data/myLargeDataSet';
        b = LIMIT a 1000000;
        ...
        where myLargeDataSet contains lots of block-sized files. Then, in that case, the Tez vertex for the POLoad has a requestedParallelism of 1, but the actual runtime parallelism will be equal to the number of files. In this case, the optimization (putting the limit only in the plan for the previous vertex, which in this case, is the vertex for the load) and not having a second vertex fails. Basically, we can't depend on requestedParallelism = 1 to actually be the parallelism at runtime.

        [Just to note, the LimitOptimizer would actually push the limit up to the Input Handler, but just to keep this example simple, let's ignore that for now]

        Show
        abain Alex Bain added a comment - 1. You can check requestedParallelism for the tezOperator. This should be doable. This doesn't sound quite right to me. Let's say you are doing: a = LOAD '/data/myLargeDataSet'; b = LIMIT a 1000000; ... where myLargeDataSet contains lots of block-sized files. Then, in that case, the Tez vertex for the POLoad has a requestedParallelism of 1, but the actual runtime parallelism will be equal to the number of files. In this case, the optimization (putting the limit only in the plan for the previous vertex, which in this case, is the vertex for the load) and not having a second vertex fails. Basically, we can't depend on requestedParallelism = 1 to actually be the parallelism at runtime. [Just to note, the LimitOptimizer would actually push the limit up to the Input Handler, but just to keep this example simple, let's ignore that for now]
        Hide
        daijy Daniel Dai added a comment -

        1. You can check requestedParallelism for the tezOperator. This should be doable.

        2. We can do a non-sorted scatter-gather, but this depends on TEZ-661, we cannot proceed now

        4. We could use combiner and duplicate POLimit in the combiner. Otherwise, plan looks good.

        Show
        daijy Daniel Dai added a comment - 1. You can check requestedParallelism for the tezOperator. This should be doable. 2. We can do a non-sorted scatter-gather, but this depends on TEZ-661 , we cannot proceed now 4. We could use combiner and duplicate POLimit in the combiner. Otherwise, plan looks good.
        Hide
        abain Alex Bain added a comment -

        1. If the previous stage using 1 reduce, no need to add one more vertex

        I think in particular this should be that if the cardinality of the previous Tez vertex (i.e. the number of run-time tasks generated by the previous vertex) equals 1, then there is no more need to add another vertex. Does this still sound like an optimization that we should still implement, and if so, how would we check such a thing?

        2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible

        I'm reading this to mean that a sort isn't necessary on the edge. Daniel - the way you wrote this, it sounds like you want to set the edge to be a broadcast edge rather than a scatter-gather or one-to-one edge. However, since the parallelism of the the final vertex that implements LIMIT is one, I don't think any of these really make a difference (correct me if I am wrong). Thus, I'm reading this to mean that we don't need to do any sorting on the edge (LIMIT explicitly says that it might return any order).

        For my patch, I am changing the input / output types to be OnFileUnorderedKVOutput / ShuffledUnorderedKVInput and leaving the edge type to SCATTER_GATHER. However, I have these lines commented out with a note that they should be uncommented after TEZ-661. Does this sound like the right thing to do?

        3. Similar to PIG-1270, we can push limit to InputHandler

        This is done by the LimitOptimizer already, I have gone through and verified it.

        4. We also need to think through the "limited order by" case once "order by" is implemented

        There is quite a bit of code to handle LIMIT in TezCompiler::getSortJobs. Is this code already sufficient?

        Show
        abain Alex Bain added a comment - 1. If the previous stage using 1 reduce, no need to add one more vertex I think in particular this should be that if the cardinality of the previous Tez vertex (i.e. the number of run-time tasks generated by the previous vertex) equals 1, then there is no more need to add another vertex. Does this still sound like an optimization that we should still implement, and if so, how would we check such a thing? 2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible I'm reading this to mean that a sort isn't necessary on the edge. Daniel - the way you wrote this, it sounds like you want to set the edge to be a broadcast edge rather than a scatter-gather or one-to-one edge. However, since the parallelism of the the final vertex that implements LIMIT is one, I don't think any of these really make a difference (correct me if I am wrong). Thus, I'm reading this to mean that we don't need to do any sorting on the edge (LIMIT explicitly says that it might return any order). For my patch, I am changing the input / output types to be OnFileUnorderedKVOutput / ShuffledUnorderedKVInput and leaving the edge type to SCATTER_GATHER. However, I have these lines commented out with a note that they should be uncommented after TEZ-661 . Does this sound like the right thing to do? 3. Similar to PIG-1270 , we can push limit to InputHandler This is done by the LimitOptimizer already, I have gone through and verified it. 4. We also need to think through the "limited order by" case once "order by" is implemented There is quite a bit of code to handle LIMIT in TezCompiler::getSortJobs. Is this code already sufficient?

          People

          • Assignee:
            abain Alex Bain
            Reporter:
            abain Alex Bain
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development