Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.4.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Right now , hive only support equal joins . Sometimes it's not enough, we must consider implementing theta joins like

      SELECT
        a.subid, a.id, t.url
      FROM
        tbl t JOIN aux_tbl a ON t.url rlike a.url_pattern
      WHERE
        t.dt='20090609'
        AND a.dt='20090609';
      

      any condition expression following 'ON' is appropriate.

        Issue Links

          Activity

          Hide
          kozanitis Christos Kozanitis added a comment -

          Hello

          I have created a PR with such a functionality for SparkSQL here: https://github.com/apache/spark/pull/2939

          Show
          kozanitis Christos Kozanitis added a comment - Hello I have created a PR with such a functionality for SparkSQL here: https://github.com/apache/spark/pull/2939
          Hide
          brocknoland Brock Noland added a comment -

          I'd like to break this up into sub tasks but before I do I'd like to solicit feedback...

          Show
          brocknoland Brock Noland added a comment - I'd like to break this up into sub tasks but before I do I'd like to solicit feedback...
          Hide
          brocknoland Brock Noland added a comment -

          Hi,

          I have created a design document for this JIRA on the wiki: https://cwiki.apache.org/confluence/display/Hive/Theta+Join

          Brock

          Show
          brocknoland Brock Noland added a comment - Hi, I have created a design document for this JIRA on the wiki: https://cwiki.apache.org/confluence/display/Hive/Theta+Join Brock
          Hide
          appodictic Edward Capriolo added a comment -

          Min any updates. Come back to us you were doing cool stuff!

          Show
          appodictic Edward Capriolo added a comment - Min any updates. Come back to us you were doing cool stuff!
          Hide
          posix4e Alex Newman added a comment -

          Any updates here?

          Show
          posix4e Alex Newman added a comment - Any updates here?
          Show
          jvs John Sichi added a comment - http://www.ccs.neu.edu/home/mirek/papers/2011-SIGMOD-ParallelJoins.pdf
          Hide
          namit Namit Jain added a comment -

          a.dt='20090609' is the filter

          Show
          namit Namit Jain added a comment - a.dt='20090609' is the filter
          Hide
          coderplay Min Zhou added a comment -

          I didn't see any filter there, hive will put all fields of my small table into HTree.

          hive>explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609';
          
          Common Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {bussiness_id} {subclass_id} {class_id} {note} {name} {url_pattern} {dt}
                          1....
          

          We only put a.url_pattern into a HashMap in our raw map-reduce implemenation.

          Show
          coderplay Min Zhou added a comment - I didn't see any filter there, hive will put all fields of my small table into HTree. hive>explain select /*+ MAPJOIN(a) */ a.url_pattern, w.url from application a join web_log w where w.logdate='20090611' and w.url rlike a.url_pattern and a.dt='20090609'; Common Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {bussiness_id} {subclass_id} {class_id} {note} {name} {url_pattern} {dt} 1.... We only put a.url_pattern into a HashMap in our raw map-reduce implemenation.
          Hide
          namit Namit Jain added a comment -

          I think what Ashish meant was that if one table is small (aux table with 10 records), you can do a cartesian product with that table via map-join which wont be very expensive and then apply your join condition in the form of a filter. Still, no sorting is required - so it should not be all that bad.

          Show
          namit Namit Jain added a comment - I think what Ashish meant was that if one table is small (aux table with 10 records), you can do a cartesian product with that table via map-join which wont be very expensive and then apply your join condition in the form of a filter. Still, no sorting is required - so it should not be all that bad.
          Hide
          coderplay Min Zhou added a comment -

          @Ashish
          I agree with you, map-side joins is okay. however, it doesnot support theta joins right now. we used to load aux tables into the memory of each map node, scan major tables and do our joins.

          Show
          coderplay Min Zhou added a comment - @Ashish I agree with you, map-side joins is okay. however, it doesnot support theta joins right now. we used to load aux tables into the memory of each map node, scan major tables and do our joins.
          Hide
          athusoo Ashish Thusoo added a comment -

          I think with map-side joins this becomes quite easy. How would you do it with a normal map-reduce join without actually doing a cartesian product at some level?

          Show
          athusoo Ashish Thusoo added a comment - I think with map-side joins this becomes quite easy. How would you do it with a normal map-reduce join without actually doing a cartesian product at some level?
          Hide
          coderplay Min Zhou added a comment -

          it's very common for us, and blocked us badly. we ofen have one or more aux tables with about 10k records, which the major table would do theta joins on. I don't think current solution by the means of cartesian product is a good way, it would bring so terrible sorting and i/o overhead to us.

          Show
          coderplay Min Zhou added a comment - it's very common for us, and blocked us badly. we ofen have one or more aux tables with about 10k records, which the major table would do theta joins on. I don't think current solution by the means of cartesian product is a good way, it would bring so terrible sorting and i/o overhead to us.
          Hide
          namit Namit Jain added a comment -

          I don't see a use case for this right now.

          The workaround is do a cartesian product (no ON condition) and push the condition in the WHERE clause - the plan will be sub-optimal currently,
          since all data will go to a single reducer.

          Show
          namit Namit Jain added a comment - I don't see a use case for this right now. The workaround is do a cartesian product (no ON condition) and push the condition in the WHERE clause - the plan will be sub-optimal currently, since all data will go to a single reducer.

            People

            • Assignee:
              brocknoland Brock Noland
              Reporter:
              coderplay Min Zhou
            • Votes:
              9 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:

                Development