Hive
  1. Hive
  2. HIVE-420

Support regular expressions for column in select clause

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.3.0, 0.4.0, 0.6.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      support queries like

       select a.`(ds|hr)?+.+` from a where a.ds='2009-09-09' 

      which are useful when there are humongous number of columns in a table and want to select all columns except the partition columns (ds and hr).

      1. HIVE-420.4.patch
        43 kB
        Zheng Shao
      2. HIVE-420.3.patch
        43 kB
        Zheng Shao
      3. HIVE-420.2.patch
        42 kB
        Zheng Shao
      4. HIVE-420.1.patch
        42 kB
        Zheng Shao

        Issue Links

          Activity

          Prasad Chakka created issue -
          Hide
          Ashish Thusoo added a comment -

          Not a bad idea. Somewhat non SQL but a useful something that the data scientists at FB have been clamoring for...

          Show
          Ashish Thusoo added a comment - Not a bad idea. Somewhat non SQL but a useful something that the data scientists at FB have been clamoring for...
          Hide
          Zheng Shao added a comment - - edited

          With Java regex library we can easily support exclusion of several columns:

           "(ds|hr)?+.+" 

          will match all columns except ds and hr.

           "(dsa|ds)?+.+" 

          will match all columns except ds and dsa. Note that "dsa" has to come before "ds" otherwise it won't work as expected - basically if there exists s and t, and s is a prefix of t, then t has to come before s.

          When the user is specifying column names, they should not need to specify "^" and "$" and we should assume the regex matches the whole string.

          Use http://www.fileformat.info/tool/regex.htm for testing the above examples.

          "(ds|hr)?+.+" 
          Try these:
          ds
          hr
          d
          h
          s
          r
          ads
          dsa
          ahr
          hra
          dsds
          dshr
          hrds
          hrhr
          abc
          
          Show
          Zheng Shao added a comment - - edited With Java regex library we can easily support exclusion of several columns: "(ds|hr)?+.+" will match all columns except ds and hr. "(dsa|ds)?+.+" will match all columns except ds and dsa. Note that "dsa" has to come before "ds" otherwise it won't work as expected - basically if there exists s and t, and s is a prefix of t, then t has to come before s. When the user is specifying column names, they should not need to specify "^" and "$" and we should assume the regex matches the whole string. Use http://www.fileformat.info/tool/regex.htm for testing the above examples. "(ds|hr)?+.+" Try these: ds hr d h s r ads dsa ahr hra dsds dshr hrds hrhr abc
          Hide
          Zheng Shao added a comment -

          Regex for everything except strings starting with "ab":

          "(ab.*)?+.+"
          a
          b
          ab
          abc
          abcd
          cab
          abc
          aab
          baba
          
          Show
          Zheng Shao added a comment - Regex for everything except strings starting with "ab": "(ab.*)?+.+" a b ab abc abcd cab abc aab baba
          Zheng Shao made changes -
          Field Original Value New Value
          Link This issue is blocked by HIVE-466 [ HIVE-466 ]
          Hide
          Zheng Shao added a comment -

          This patch contains HIVE-466.1.patch since that's not committed yet.
          It added the capability of specifying columns using regex.

          Show
          Zheng Shao added a comment - This patch contains HIVE-466 .1.patch since that's not committed yet. It added the capability of specifying columns using regex.
          Zheng Shao made changes -
          Attachment HIVE-420.1.patch [ 12406994 ]
          Zheng Shao made changes -
          Assignee Zheng Shao [ zshao ]
          Hide
          Zheng Shao added a comment -

          Fixed some test errors. We do not allow the use of regex columns in GROUP BY queries.

          Show
          Zheng Shao added a comment - Fixed some test errors. We do not allow the use of regex columns in GROUP BY queries.
          Zheng Shao made changes -
          Attachment HIVE-420.2.patch [ 12406996 ]
          Hide
          Zheng Shao added a comment -

          Fixed another bug: in the last patch, normal columns names are also considered regex, and ambiguous join columns will become 2 columns instead of reporting an ambiguous error. This patch fixes this problem. We do regex matching if and only if the column name contains special characters.

          Show
          Zheng Shao added a comment - Fixed another bug: in the last patch, normal columns names are also considered regex, and ambiguous join columns will become 2 columns instead of reporting an ambiguous error. This patch fixes this problem. We do regex matching if and only if the column name contains special characters.
          Zheng Shao made changes -
          Attachment HIVE-420.3.patch [ 12406997 ]
          Zheng Shao made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.4.0 [ 12313714 ]
          Hide
          Namit Jain added a comment -

          Looks good - can you add a negative test where you are using regex columns in group by queries.
          After that, I will run test and commit if the tests pass

          Show
          Namit Jain added a comment - Looks good - can you add a negative test where you are using regex columns in group by queries. After that, I will run test and commit if the tests pass
          Hide
          Zheng Shao added a comment -

          Added a negative test for Group By regex_col.

          Show
          Zheng Shao added a comment - Added a negative test for Group By regex_col.
          Zheng Shao made changes -
          Attachment HIVE-420.4.patch [ 12407038 ]
          Hide
          Namit Jain added a comment -

          +1

          looks good - will commit if the tests pass

          Show
          Namit Jain added a comment - +1 looks good - will commit if the tests pass
          Show
          Amr Awadallah added a comment - Very cool feature. BTW, for this comment: https://issues.apache.org/jira/browse/HIVE-420?focusedCommentId=12704795&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12704795 the regex is missing a ^
          Hide
          Zheng Shao added a comment -

          @Amr, I think we don't need ^ and $ for regex matching. We only need that for regex searching in case the searched regex has to start from the beginning of the string or the end of the string.

          For matching, the whole string has to be matched with the regex, so there is no need for ^ and $.

          What do you think?

          You can try my examples out at http://www.fileformat.info/tool/regex.htm

          Show
          Zheng Shao added a comment - @Amr, I think we don't need ^ and $ for regex matching. We only need that for regex searching in case the searched regex has to start from the beginning of the string or the end of the string. For matching, the whole string has to be matched with the regex, so there is no need for ^ and $. What do you think? You can try my examples out at http://www.fileformat.info/tool/regex.htm
          Hide
          Namit Jain added a comment -

          Committed. Thanks Zheng

          Show
          Namit Jain added a comment - Committed. Thanks Zheng
          Namit Jain made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Zheng Shao made changes -
          Description support queries like `select a.[^ds] from a where a.ds='2009-09-09'` which are useful when there are humongous number of columns in a table and want to select all columns except the partitions columns. support queries like `select a.`(ds|hr)?+.+` from a where a.ds='2009-09-09'` which are useful when there are humongous number of columns in a table and want to select all columns except the partition columns (ds and hr).
          Zheng Shao made changes -
          Description support queries like `select a.`(ds|hr)?+.+` from a where a.ds='2009-09-09'` which are useful when there are humongous number of columns in a table and want to select all columns except the partition columns (ds and hr). support queries like {code} select a.`(ds|hr)?+.+` from a where a.ds='2009-09-09' {code} which are useful when there are humongous number of columns in a table and want to select all columns except the partition columns (ds and hr).
          Hide
          Amr Awadallah added a comment -

          @Zheng, good point, they are indeed redundant for match.

          Show
          Amr Awadallah added a comment - @Zheng, good point, they are indeed redundant for match.
          Hide
          Zheng Shao added a comment -
          Show
          Zheng Shao added a comment - A link for java regex quantifiers: http://java.sun.com/docs/books/tutorial/essential/regex/quant.html
          Zheng Shao made changes -
          Affects Version/s 0.6.0 [ 12314524 ]
          Affects Version/s 0.2.0 [ 12313565 ]
          Carl Steinbach made changes -
          Affects Version/s 0.3.1 [ 12313845 ]
          Carl Steinbach made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Mariano Dominguez made changes -
          Link This issue relates to HIVE-7631 [ HIVE-7631 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          15d 16h 32m 1 Zheng Shao 01/May/09 10:54
          Patch Available Patch Available Resolved Resolved
          12h 12m 1 Namit Jain 01/May/09 23:06
          Resolved Resolved Closed Closed
          959d 2h 1 Carl Steinbach 17/Dec/11 00:07

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Prasad Chakka
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development