Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.4.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Right now , hive only support equal joins . Sometimes it's not enough, we must consider implementing theta joins like

      SELECT
        a.subid, a.id, t.url
      FROM
        tbl t JOIN aux_tbl a ON t.url rlike a.url_pattern
      WHERE
        t.dt='20090609'
        AND a.dt='20090609';
      

      any condition expression following 'ON' is appropriate.

        Issue Links

          Activity

          Hide
          Namit Jain added a comment -

          I don't see a use case for this right now.

          The workaround is do a cartesian product (no ON condition) and push the condition in the WHERE clause - the plan will be sub-optimal currently,
          since all data will go to a single reducer.

          Show
          Namit Jain added a comment - I don't see a use case for this right now. The workaround is do a cartesian product (no ON condition) and push the condition in the WHERE clause - the plan will be sub-optimal currently, since all data will go to a single reducer.
          Hide
          Min Zhou added a comment -

          it's very common for us, and blocked us badly. we ofen have one or more aux tables with about 10k records, which the major table would do theta joins on. I don't think current solution by the means of cartesian product is a good way, it would bring so terrible sorting and i/o overhead to us.

          Show
          Min Zhou added a comment - it's very common for us, and blocked us badly. we ofen have one or more aux tables with about 10k records, which the major table would do theta joins on. I don't think current solution by the means of cartesian product is a good way, it would bring so terrible sorting and i/o overhead to us.
          Hide
          Ashish Thusoo added a comment -

          I think with map-side joins this becomes quite easy. How would you do it with a normal map-reduce join without actually doing a cartesian product at some level?

          Show
          Ashish Thusoo added a comment - I think with map-side joins this becomes quite easy. How would you do it with a normal map-reduce join without actually doing a cartesian product at some level?
          Hide
          Min Zhou added a comment -

          @Ashish
          I agree with you, map-side joins is okay. however, it doesnot support theta joins right now. we used to load aux tables into the memory of each map node, scan major tables and do our joins.

          Show
          Min Zhou added a comment - @Ashish I agree with you, map-side joins is okay. however, it doesnot support theta joins right now. we used to load aux tables into the memory of each map node, scan major tables and do our joins.
          Hide
          Namit Jain added a comment -

          I think what Ashish meant was that if one table is small (aux table with 10 records), you can do a cartesian product with that table via map-join which wont be very expensive and then apply your join condition in the form of a filter. Still, no sorting is required - so it should not be all that bad.

          Show
          Namit Jain added a comment - I think what Ashish meant was that if one table is small (aux table with 10 records), you can do a cartesian product with that table via map-join which wont be very expensive and then apply your join condition in the form of a filter. Still, no sorting is required - so it should not be all that bad.
          Hide
          Min Zhou added a comment -

          I didn't see any filter there, hive will put all fields of my small table into HTree.

          hive>explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
          
          Common Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {bussiness_id} {subclass_id} {class_id} {note} {name} {url_pattern} {dt}
                          1....
          

          We only put a.url_pattern into a HashMap in our raw map-reduce implemenation.

          Show
          Min Zhou added a comment - I didn't see any filter there, hive will put all fields of my small table into HTree. hive>explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609'; Common Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {bussiness_id} {subclass_id} {class_id} {note} {name} {url_pattern} {dt} 1.... We only put a.url_pattern into a HashMap in our raw map-reduce implemenation.
          Hide
          Namit Jain added a comment -

          a.dt='20090609' is the filter

          Show
          Namit Jain added a comment - a.dt='20090609' is the filter
          Show
          John Sichi added a comment - http://www.ccs.neu.edu/home/mirek/papers/2011-SIGMOD-ParallelJoins.pdf
          Hide
          Alex Newman added a comment -

          Any updates here?

          Show
          Alex Newman added a comment - Any updates here?
          Hide
          Edward Capriolo added a comment -

          Min any updates. Come back to us you were doing cool stuff!

          Show
          Edward Capriolo added a comment - Min any updates. Come back to us you were doing cool stuff!
          Hide
          Brock Noland added a comment -

          Hi,

          I have created a design document for this JIRA on the wiki: https://cwiki.apache.org/confluence/display/Hive/Theta+Join

          Brock

          Show
          Brock Noland added a comment - Hi, I have created a design document for this JIRA on the wiki: https://cwiki.apache.org/confluence/display/Hive/Theta+Join Brock
          Hide
          Brock Noland added a comment -

          I'd like to break this up into sub tasks but before I do I'd like to solicit feedback...

          Show
          Brock Noland added a comment - I'd like to break this up into sub tasks but before I do I'd like to solicit feedback...

            People

            • Assignee:
              Brock Noland
              Reporter:
              Min Zhou
            • Votes:
              7 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:

                Development