Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11.0
    • Component/s: Query Processor
    • Labels:
      None
    • Release Note:
      Map join hint will no longer be valid for some queries. Drop the hint in those cases. Hive will automatically try to convert join to map-join with config hive.auto.convert.join set to true.

      Description

      hive.auto.convert.join has been around for a long time, and is pretty stable.
      When mapjoin hint was created, the above parameter did not exist.

      The only reason for the user to specify a mapjoin currently is if they want
      it to be converted to a bucketed-mapjoin or a sort-merge bucketed mapjoin.
      Eventually, that should also go away, but that may take some time to stabilize.

      There are many rules in SemanticAnalyzer to handle the following trees:

      ReduceSink -> MapJoin
      Union -> MapJoin
      MapJoin -> MapJoin

      This should not be supported anymore. In any of the above scenarios, the
      user can get the mapjoin behavior by setting hive.auto.convert.join to true
      and not specifying the hint. This will simplify the code a lot.

      What does everyone think ?

      1. hive.3784.1.patch
        209 kB
        Namit Jain
      2. hive.3784.10.patch
        504 kB
        Namit Jain
      3. hive.3784.11.patch
        547 kB
        Namit Jain
      4. hive.3784.12.patch
        550 kB
        Namit Jain
      5. hive.3784.13.patch
        617 kB
        Namit Jain
      6. hive.3784.14.patch
        617 kB
        Namit Jain
      7. hive.3784.15.patch
        657 kB
        Namit Jain
      8. hive.3784.16.patch
        658 kB
        Namit Jain
      9. hive.3784.17.patch
        659 kB
        Namit Jain
      10. hive.3784.18.patch
        659 kB
        Namit Jain
      11. hive.3784.19.patch
        659 kB
        Namit Jain
      12. hive.3784.2.patch
        209 kB
        Namit Jain
      13. hive.3784.21.patch
        659 kB
        Namit Jain
      14. hive.3784.22.patch
        665 kB
        Namit Jain
      15. hive.3784.3.patch
        434 kB
        Namit Jain
      16. hive.3784.4.patch
        650 kB
        Namit Jain
      17. hive.3784.5.patch
        650 kB
        Namit Jain
      18. hive.3784.6.patch
        528 kB
        Namit Jain
      19. hive.3784.7.patch
        571 kB
        Namit Jain
      20. hive.3784.8.patch
        578 kB
        Namit Jain
      21. hive.3784.9.patch
        581 kB
        Namit Jain

        Issue Links

          Activity

          Hide
          Ashutosh Chauhan added a comment -

          I agree. In general,
          more config = more confusion for user to use it.
          more config = more confusion for developer to maintain code.

          Show
          Ashutosh Chauhan added a comment - I agree. In general, more config = more confusion for user to use it. more config = more confusion for developer to maintain code.
          Hide
          Namit Jain added a comment -

          The high level idea is something like this: https://reviews.facebook.net/D7257

          It should end up in removing a lot of redundant code.

          Show
          Namit Jain added a comment - The high level idea is something like this: https://reviews.facebook.net/D7257 It should end up in removing a lot of redundant code.
          Hide
          Namit Jain added a comment -

          This removes a lot of redundant code, and also optimizes a lot of queries - mapjoin followed by groupby.
          Due to that, there are a lof of plan changes.

          Show
          Namit Jain added a comment - This removes a lot of redundant code, and also optimizes a lot of queries - mapjoin followed by groupby. Due to that, there are a lof of plan changes.
          Hide
          Namit Jain added a comment -

          Verified that all the plan changes actually remove a redundant MR stage. There is no change in query results.

          Show
          Namit Jain added a comment - Verified that all the plan changes actually remove a redundant MR stage. There is no change in query results.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Hi, couple of questsions:

          • Does this rule out bucketed map-join or hive.optimize.bucketmapjoin will continue to work? If it is the earlier, shouldn't fixing that be a blocker of this?
          • Also, does this rule out map join of multiple small tables in a single map-only job? As discussed on HIVE-3652, giving map-join hints to a nested join automatically converts it into a single map-join map.

          also optimizes a lot of queries - mapjoin followed by groupby.

          This is great!

          Show
          Vinod Kumar Vavilapalli added a comment - Hi, couple of questsions: Does this rule out bucketed map-join or hive.optimize.bucketmapjoin will continue to work? If it is the earlier, shouldn't fixing that be a blocker of this? Also, does this rule out map join of multiple small tables in a single map-only job? As discussed on HIVE-3652 , giving map-join hints to a nested join automatically converts it into a single map-join map. also optimizes a lot of queries - mapjoin followed by groupby. This is great!
          Hide
          Namit Jain added a comment -

          >>> Does this rule out bucketed map-join or hive.optimize.bucketmapjoin will continue to work? If it is the earlier, shouldn't fixing that be a blocker of this?

          That will continue to work.

          >>> Also, does this rule out map join of multiple small tables in a single map-only job? As discussed on HIVE-3652, giving map-join hints to a nested join automatically converts it into a single map-join map.

          No. If the join key is the same, it will be a single MR job as today.
          With different join keys, it needs some work to merge into a single MR anyway - that work is independent of this change.

          As I said before, with this jira, everything that explicitly needs a map-join hint will continue to work as is.
          Whatever is being de-supported did not need a explicit mapjoin hint in the first place.

          Show
          Namit Jain added a comment - >>> Does this rule out bucketed map-join or hive.optimize.bucketmapjoin will continue to work? If it is the earlier, shouldn't fixing that be a blocker of this? That will continue to work. >>> Also, does this rule out map join of multiple small tables in a single map-only job? As discussed on HIVE-3652 , giving map-join hints to a nested join automatically converts it into a single map-join map. No. If the join key is the same, it will be a single MR job as today. With different join keys, it needs some work to merge into a single MR anyway - that work is independent of this change. As I said before, with this jira, everything that explicitly needs a map-join hint will continue to work as is. Whatever is being de-supported did not need a explicit mapjoin hint in the first place.
          Hide
          Namit Jain added a comment -

          updated some log files - all tests passed

          Show
          Namit Jain added a comment - updated some log files - all tests passed
          Hide
          Amareshwari Sriramadasu added a comment -

          Just confirming MapJoin->ReduceSink (MapJoin followed by Groupby) be single Map-Reduce Job and can we close HIVE-1695 a duplicate of this?

          Show
          Amareshwari Sriramadasu added a comment - Just confirming MapJoin->ReduceSink (MapJoin followed by Groupby) be single Map-Reduce Job and can we close HIVE-1695 a duplicate of this?
          Hide
          Namit Jain added a comment -
          Show
          Namit Jain added a comment - Amareshwari Sriramadasu , yes
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Thanks for the clarification, Namit!

          Show
          Vinod Kumar Vavilapalli added a comment - Thanks for the clarification, Namit!
          Hide
          Mark Grover added a comment -

          Also, this would a good opportunity to consider making hive.auto.convert.join true by defualt? I am using Hive 0.9, and it's false there by default. How do you guys feel about changing it to true by default?

          Show
          Mark Grover added a comment - Also, this would a good opportunity to consider making hive.auto.convert.join true by defualt? I am using Hive 0.9, and it's false there by default. How do you guys feel about changing it to true by default?
          Hide
          Mark Grover added a comment -
          Show
          Mark Grover added a comment - Never mind. Found this: https://issues.apache.org/jira/browse/HIVE-3297
          Hide
          Namit Jain added a comment -

          Mark, couldn't agree more.
          That would be a very big patch (due to the number of output files changed), so would like to de-couple HIVE-3784 and HIVE-3297

          Show
          Namit Jain added a comment - Mark, couldn't agree more. That would be a very big patch (due to the number of output files changed), so would like to de-couple HIVE-3784 and HIVE-3297
          Hide
          Mark Grover added a comment -

          Agreed!

          Show
          Mark Grover added a comment - Agreed!
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Was trying to play with the patch, and my earlier concern resurfaced.

          With different join keys, it needs some work to merge into a single MR anyway - that work is independent of this change.

          That isn't true. Even today, I am able to get hive to automatically merge multi-way map-join with different join keys into a single map-only job. With this patch, we are losing that functionality. For e.g., the following runs as a single Map only job:

          select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
                  ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
                        bigTable                                       
                        JOIN                                                                
                        smallTableOne on (bigTable.idOne = smallTableOne.idOne)                                                   
                    ) firstjoin                                                             
                      JOIN                                                                  
                      smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo)                       
          
          Show
          Vinod Kumar Vavilapalli added a comment - Was trying to play with the patch, and my earlier concern resurfaced. With different join keys, it needs some work to merge into a single MR anyway - that work is independent of this change. That isn't true. Even today, I am able to get hive to automatically merge multi-way map-join with different join keys into a single map-only job. With this patch, we are losing that functionality. For e.g., the following runs as a single Map only job: select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne) ) firstjoin JOIN smallTableTwo on (firstjoin.idTwo = smallTableTwo.idTwo)
          Hide
          Namit Jain added a comment -

          Vinod Kumar Vavilapalli, I agree that with change, it will be 2 map-only jobs instead of 1 map job.
          I haven't tested it myself, but seems likely.

          But, the current code has way too much complexity to handle this special case. Ideally, this change
          should be done as part of converting join tasks into conditional join tasks. That layer should be smarter
          to see that there is no need of a conditional task, and a map-only task is possible. Also, another layer needs
          to be written to merge consecutive map-only tasks.

          Although, for this special case, we are taking a hit, I still believe this is the right long term way to go.

          Show
          Namit Jain added a comment - Vinod Kumar Vavilapalli , I agree that with change, it will be 2 map-only jobs instead of 1 map job. I haven't tested it myself, but seems likely. But, the current code has way too much complexity to handle this special case. Ideally, this change should be done as part of converting join tasks into conditional join tasks. That layer should be smarter to see that there is no need of a conditional task, and a map-only task is possible. Also, another layer needs to be written to merge consecutive map-only tasks. Although, for this special case, we are taking a hit, I still believe this is the right long term way to go.
          Hide
          Ashutosh Chauhan added a comment -

          Namit,
          I very much like this work, since diffs are all red, which is good since it is removing lot of unnecessary complexity from the codebase, but Vinod's point is worth considering. I also see this you have moved mapjoin_subquery.q, mapjoin_mapjoin.q and such test cases from positive to negative. Do you have a proposal on what could be done to preserve that optimization of pipelining multiple map-join operators (even on different keys) in a single mapper?

          Show
          Ashutosh Chauhan added a comment - Namit, I very much like this work, since diffs are all red, which is good since it is removing lot of unnecessary complexity from the codebase, but Vinod's point is worth considering. I also see this you have moved mapjoin_subquery.q, mapjoin_mapjoin.q and such test cases from positive to negative. Do you have a proposal on what could be done to preserve that optimization of pipelining multiple map-join operators (even on different keys) in a single mapper?
          Hide
          Namit Jain added a comment -

          Let me think about it - this is like a star-schema join.

          Show
          Namit Jain added a comment - Let me think about it - this is like a star-schema join.
          Hide
          Ashutosh Chauhan added a comment -

          This patch can optimize

          select /*+ MAPJOIN(t2) */ * from t1 join t2 on t1.t11 = t2.t21   group by t1.t12;
          

          to generate 1 MR job, which is cool. Now only if it can handle map-joins on different keys correctly, this will be a real winner for Hive.

          Show
          Ashutosh Chauhan added a comment - This patch can optimize select /*+ MAPJOIN(t2) */ * from t1 join t2 on t1.t11 = t2.t21 group by t1.t12; to generate 1 MR job, which is cool. Now only if it can handle map-joins on different keys correctly, this will be a real winner for Hive.
          Hide
          Namit Jain added a comment -

          I was thinking of adding a size parameter. If n-1 tables are below that size (for a n-way join), the joinTask should be converted to a mapJoin task
          (map-only) instead of a conditional task. We would need a further optimization step to merge 2 map-only tasks to a single map-only task.

          Navis, what do you think ? Can you think of a better idea ?

          Show
          Namit Jain added a comment - I was thinking of adding a size parameter. If n-1 tables are below that size (for a n-way join), the joinTask should be converted to a mapJoin task (map-only) instead of a conditional task. We would need a further optimization step to merge 2 map-only tasks to a single map-only task. Navis , what do you think ? Can you think of a better idea ?
          Hide
          Namit Jain added a comment -

          I got the above query working. The latest patch has the changes.
          Will start cleaning up, and fixing the patch.

          Show
          Namit Jain added a comment - I got the above query working. The latest patch has the changes. Will start cleaning up, and fixing the patch.
          Hide
          Namit Jain added a comment -

          Recently, in https://issues.apache.org/jira/browse/HIVE-3633, support was added for sub-query sort-merge joins, where joins
          could be performed across sub-queries, and each sub-query was transformed into a sort-merge join. This support is being removed,
          will be added automatically as part of https://issues.apache.org/jira/browse/HIVE-3403

          Show
          Namit Jain added a comment - Recently, in https://issues.apache.org/jira/browse/HIVE-3633 , support was added for sub-query sort-merge joins, where joins could be performed across sub-queries, and each sub-query was transformed into a sort-merge join. This support is being removed, will be added automatically as part of https://issues.apache.org/jira/browse/HIVE-3403
          Hide
          Namit Jain added a comment -

          Running tests after refreshing.

          Vinod Kumar Vavilapalli, does it look OK for your usecase ?

          Show
          Namit Jain added a comment - Running tests after refreshing. Vinod Kumar Vavilapalli , does it look OK for your usecase ?
          Hide
          Namit Jain added a comment -

          The tests passed

          Show
          Namit Jain added a comment - The tests passed
          Hide
          Ashutosh Chauhan added a comment -

          I am still reviewing the code. I have added some initial comments on phabricator. But, at this point my major concern is following:
          Reading from code it feels like that its not possible to have a) union before mapjoin b) union after mapjoin c) common join after mapjoin. In all three cases there will be performance impact for end-users. Ideally, I would like to see all these cases handled before this get in, though I am particularly concerned about c) common join after mapjoin, since that looks relatively more common use-case than a) or b). Can you take a look and see how hard it will be do it?

          Show
          Ashutosh Chauhan added a comment - I am still reviewing the code. I have added some initial comments on phabricator. But, at this point my major concern is following: Reading from code it feels like that its not possible to have a) union before mapjoin b) union after mapjoin c) common join after mapjoin. In all three cases there will be performance impact for end-users. Ideally, I would like to see all these cases handled before this get in, though I am particularly concerned about c) common join after mapjoin, since that looks relatively more common use-case than a) or b). Can you take a look and see how hard it will be do it?
          Hide
          Namit Jain added a comment -

          Reading from code it feels like that its not possible to have a) union before mapjoin b) union after mapjoin c) common join after mapjoin.

          Union before/after mapjoin hint will not be supported – That is not true completely. Without a map-join hint, joins before/after union
          will automatically get converted to a map-join. There is nothing which was supported before and will not be in future - it will work without
          the hint.

          common join after mapjoin

          Same thing. The only small case if for sort-merge joins, which I added recently in HIVE-3633. This will also be made automatic soon.

          I will address your other comments.

          Show
          Namit Jain added a comment - Reading from code it feels like that its not possible to have a) union before mapjoin b) union after mapjoin c) common join after mapjoin. Union before/after mapjoin hint will not be supported – That is not true completely. Without a map-join hint, joins before/after union will automatically get converted to a map-join. There is nothing which was supported before and will not be in future - it will work without the hint. common join after mapjoin Same thing. The only small case if for sort-merge joins, which I added recently in HIVE-3633 . This will also be made automatic soon. I will address your other comments.
          Hide
          Namit Jain added a comment -

          Patch submitted with comments addressed.

          Show
          Namit Jain added a comment - Patch submitted with comments addressed.
          Hide
          Ashutosh Chauhan added a comment -

          Had an offline review with Namit. Two major points came out:

          • Consider a case of join followed by group-by. In such a case if Join gets converted into map-join than subsequent group-by can be pushed into first MR job, instead of executing it in second MR. Will be done in follow-up jira: HIVE-3952
          • Instead of specifying size corresponding to 1 table, it will be better to specify combined size for all tables as a threshold.

          Namit, I had other code level comments which I added on Phabricator.

          Show
          Ashutosh Chauhan added a comment - Had an offline review with Namit. Two major points came out: Consider a case of join followed by group-by. In such a case if Join gets converted into map-join than subsequent group-by can be pushed into first MR job, instead of executing it in second MR. Will be done in follow-up jira: HIVE-3952 Instead of specifying size corresponding to 1 table, it will be better to specify combined size for all tables as a threshold. Namit, I had other code level comments which I added on Phabricator.
          Hide
          Namit Jain added a comment -

          comments addressed. running tests

          Show
          Namit Jain added a comment - comments addressed. running tests
          Hide
          Ashutosh Chauhan added a comment -

          Thanks Namit for addressing comments. There were few which were unaddressed. I added them again on Phabricator.

          Show
          Ashutosh Chauhan added a comment - Thanks Namit for addressing comments. There were few which were unaddressed. I added them again on Phabricator.
          Hide
          Namit Jain added a comment -

          The tests finished fine

          Show
          Namit Jain added a comment - The tests finished fine
          Hide
          Namit Jain added a comment -

          Comments addressed in the latest patch.

          Show
          Namit Jain added a comment - Comments addressed in the latest patch.
          Hide
          Ashutosh Chauhan added a comment -

          Thanks for adding comments. +1 Running tests now.

          Show
          Ashutosh Chauhan added a comment - Thanks for adding comments. +1 Running tests now.
          Hide
          Ashutosh Chauhan added a comment -

          Committed to trunk. Thanks, Namit for tons of clean-up in this work!

          Show
          Ashutosh Chauhan added a comment - Committed to trunk. Thanks, Namit for tons of clean-up in this work!
          Hide
          Namit Jain added a comment -

          Thanks a lot Ashutosh

          Show
          Namit Jain added a comment - Thanks a lot Ashutosh
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Namit Jain Sorry I was away and couldn't reply back.

          Thanks for addressing my use-case, I'll play with it!

          Show
          Vinod Kumar Vavilapalli added a comment - Namit Jain Sorry I was away and couldn't reply back. Thanks for addressing my use-case, I'll play with it!

            People

            • Assignee:
              Namit Jain
              Reporter:
              Namit Jain
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development