Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3.0, 0.4.0, 0.4.1, 0.5.0
    • Fix Version/s: 0.6.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      SELECT * FROM tablename t
      WHERE IN(12345,key1,key2,key3);

      ...IN would operate on a given row, and return True when the first argument equaled at least one of the other arguments. So here IN would return true if 12345=key1 OR 12345=key2 OR 12345=key3 (but wouldn't test the latter two if the first matched).

      This would also help with https://issues.apache.org/jira/browse/HIVE-783, if IN were implemented in a manner that allows it to be used in an ON clause.

      1. HIVE-801.1.patch
        17 kB
        Paul Yang
      2. HIVE-801.2.patch
        17 kB
        Paul Yang
      3. HIVE-801.3.patch
        15 kB
        Paul Yang

        Issue Links

          Activity

          Hide
          Namit Jain added a comment -

          Committed. Thanks Paul

          Show
          Namit Jain added a comment - Committed. Thanks Paul
          Hide
          Namit Jain added a comment -

          +1

          will commit if the tests pass

          Show
          Namit Jain added a comment - +1 will commit if the tests pass
          Hide
          Paul Yang added a comment -
          • Fixed build issue with Hadoop 0.17.2.1
          Show
          Paul Yang added a comment - Fixed build issue with Hadoop 0.17.2.1
          Hide
          Paul Yang added a comment -
          • Added comment about type conversion
          Show
          Paul Yang added a comment - Added comment about type conversion
          Hide
          Paul Yang added a comment -

          The issue with using GenericUDFField is that it requires that all argument types to be the same. e.g. field(1, "1") will return 0 because 1 is an integer while "1" is a string. My implementation of IN allows type conversions so that 1 IN ("1") returns true. If we want to have the same kind of type conversion behavior, then it's possible to merge the two.

          Also, that test was to demonstrate the use case in the JIRA where there's an IN in the WHERE clause.

          Show
          Paul Yang added a comment - The issue with using GenericUDFField is that it requires that all argument types to be the same. e.g. field(1, "1") will return 0 because 1 is an integer while "1" is a string. My implementation of IN allows type conversions so that 1 IN ("1") returns true. If we want to have the same kind of type conversion behavior, then it's possible to merge the two. Also, that test was to demonstrate the use case in the JIRA where there's an IN in the WHERE clause.
          Hide
          Namit Jain added a comment -

          Do you want to reuse GenericUDFField.java - or create some common utility and use that.
          Also, why do you have a test

          SELECT key FROM src WHERE key IN ("238", 86);

          in udf_in.q

          Show
          Namit Jain added a comment - Do you want to reuse GenericUDFField.java - or create some common utility and use that. Also, why do you have a test SELECT key FROM src WHERE key IN ("238", 86); in udf_in.q
          Hide
          Paul Yang added a comment -

          This patch supports IN using the SQL syntax:

          SELECT * FROM src WHERE key in(238, 11, ...)
          

          Currently, we only support column names / expressions with in. For a case like WHERE a.foo IN (b.*) , we'll need to support column regexes like SELECT. I'm not sure if that is a big use case though, as most functions take a fixed number of arguments anyway.

          Show
          Paul Yang added a comment - This patch supports IN using the SQL syntax: SELECT * FROM src WHERE key in(238, 11, ...) Currently, we only support column names / expressions with in. For a case like WHERE a.foo IN (b.*) , we'll need to support column regexes like SELECT. I'm not sure if that is a big use case though, as most functions take a fixed number of arguments anyway.
          Hide
          Adam Kramer added a comment -

          Also note that the true utility of this is syntax like
          WHERE a.foo IN (b.*)

          ...for instances where b has many many columns and it is messy to articulate them. I'm thinking about a current table I have with 800 columns...is there a limit on the character-wise length of a query?

          Show
          Adam Kramer added a comment - Also note that the true utility of this is syntax like WHERE a.foo IN (b.*) ...for instances where b has many many columns and it is messy to articulate them. I'm thinking about a current table I have with 800 columns...is there a limit on the character-wise length of a query?
          Hide
          Adam Kramer added a comment -

          Using UNION ALL doesn't do the same as a disjunctive join, though, because in the case of key1 = key2 = key3 = 12345, union all would end up with 3 rows in the output instead of just one row. If we can make assumptions about duplicate rows, we can sort and transform with 'uniq,' but sometimes duplicate rows are meaningful.

          The CASE usage is interesting and intuitive, but throws an error: FAILED: Error in semantic analysis: line 3:38 Both Left and Right Aliases Encountered in Join 1. I don't really understand what this means, but it is the same error I get if I use an IF statement like IF(key1=12345 OR key2=12345,True,False), which was my first attempt to "fool" hive into doing a disjunctive join.

          (also, ELSE NULL END throws a separate error of "TOK_NULL encountered with 0 children")

          Show
          Adam Kramer added a comment - Using UNION ALL doesn't do the same as a disjunctive join, though, because in the case of key1 = key2 = key3 = 12345, union all would end up with 3 rows in the output instead of just one row. If we can make assumptions about duplicate rows, we can sort and transform with 'uniq,' but sometimes duplicate rows are meaningful. The CASE usage is interesting and intuitive, but throws an error: FAILED: Error in semantic analysis: line 3:38 Both Left and Right Aliases Encountered in Join 1. I don't really understand what this means, but it is the same error I get if I use an IF statement like IF(key1=12345 OR key2=12345,True,False), which was my first attempt to "fool" hive into doing a disjunctive join. (also, ELSE NULL END throws a separate error of "TOK_NULL encountered with 0 children")
          Hide
          Raghotham Murthy added a comment -

          Couldnt you do exactly the same thing with IF or CASE constructs in the WHERE clause? A general disjunctive join can be written as a UNION ALL of several join queries. And a general theta join can be done (inefficiently) with a cross product.

          However, you should be able to achieve what you want in HIVE-783 in the ON clause by doing something like the following. Although its a little verbose, you end up with a single map-reduce job and is probably the most efficient way to doit.

          JOIN ON (12345 = (case 12345 when key1 then 12345 when key2 then 12345 when key3 then 12345 else null end))

          Show
          Raghotham Murthy added a comment - Couldnt you do exactly the same thing with IF or CASE constructs in the WHERE clause? A general disjunctive join can be written as a UNION ALL of several join queries. And a general theta join can be done (inefficiently) with a cross product. However, you should be able to achieve what you want in HIVE-783 in the ON clause by doing something like the following. Although its a little verbose, you end up with a single map-reduce job and is probably the most efficient way to doit. JOIN ON (12345 = (case 12345 when key1 then 12345 when key2 then 12345 when key3 then 12345 else null end))

            People

            • Assignee:
              Paul Yang
              Reporter:
              Adam Kramer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development