Hive
  1. Hive
  2. HIVE-420

Support regular expressions for column in select clause

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.3.0, 0.4.0, 0.6.0
    • Fix Version/s: 0.4.0
    • Component/s: Query Processor
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      support queries like

       select a.`(ds|hr)?+.+` from a where a.ds='2009-09-09' 

      which are useful when there are humongous number of columns in a table and want to select all columns except the partition columns (ds and hr).

      1. HIVE-420.1.patch
        42 kB
        Zheng Shao
      2. HIVE-420.2.patch
        42 kB
        Zheng Shao
      3. HIVE-420.3.patch
        43 kB
        Zheng Shao
      4. HIVE-420.4.patch
        43 kB
        Zheng Shao

        Issue Links

          Activity

          Hide
          Ashish Thusoo added a comment -

          Not a bad idea. Somewhat non SQL but a useful something that the data scientists at FB have been clamoring for...

          Show
          Ashish Thusoo added a comment - Not a bad idea. Somewhat non SQL but a useful something that the data scientists at FB have been clamoring for...
          Hide
          Zheng Shao added a comment - - edited

          With Java regex library we can easily support exclusion of several columns:

           "(ds|hr)?+.+" 

          will match all columns except ds and hr.

           "(dsa|ds)?+.+" 

          will match all columns except ds and dsa. Note that "dsa" has to come before "ds" otherwise it won't work as expected - basically if there exists s and t, and s is a prefix of t, then t has to come before s.

          When the user is specifying column names, they should not need to specify "^" and "$" and we should assume the regex matches the whole string.

          Use http://www.fileformat.info/tool/regex.htm for testing the above examples.

          "(ds|hr)?+.+" 
          Try these:
          ds
          hr
          d
          h
          s
          r
          ads
          dsa
          ahr
          hra
          dsds
          dshr
          hrds
          hrhr
          abc
          
          Show
          Zheng Shao added a comment - - edited With Java regex library we can easily support exclusion of several columns: "(ds|hr)?+.+" will match all columns except ds and hr. "(dsa|ds)?+.+" will match all columns except ds and dsa. Note that "dsa" has to come before "ds" otherwise it won't work as expected - basically if there exists s and t, and s is a prefix of t, then t has to come before s. When the user is specifying column names, they should not need to specify "^" and "$" and we should assume the regex matches the whole string. Use http://www.fileformat.info/tool/regex.htm for testing the above examples. "(ds|hr)?+.+" Try these: ds hr d h s r ads dsa ahr hra dsds dshr hrds hrhr abc
          Hide
          Zheng Shao added a comment -

          Regex for everything except strings starting with "ab":

          "(ab.*)?+.+"
          a
          b
          ab
          abc
          abcd
          cab
          abc
          aab
          baba
          
          Show
          Zheng Shao added a comment - Regex for everything except strings starting with "ab": "(ab.*)?+.+" a b ab abc abcd cab abc aab baba
          Hide
          Zheng Shao added a comment -

          This patch contains HIVE-466.1.patch since that's not committed yet.
          It added the capability of specifying columns using regex.

          Show
          Zheng Shao added a comment - This patch contains HIVE-466 .1.patch since that's not committed yet. It added the capability of specifying columns using regex.
          Hide
          Zheng Shao added a comment -

          Fixed some test errors. We do not allow the use of regex columns in GROUP BY queries.

          Show
          Zheng Shao added a comment - Fixed some test errors. We do not allow the use of regex columns in GROUP BY queries.
          Hide
          Zheng Shao added a comment -

          Fixed another bug: in the last patch, normal columns names are also considered regex, and ambiguous join columns will become 2 columns instead of reporting an ambiguous error. This patch fixes this problem. We do regex matching if and only if the column name contains special characters.

          Show
          Zheng Shao added a comment - Fixed another bug: in the last patch, normal columns names are also considered regex, and ambiguous join columns will become 2 columns instead of reporting an ambiguous error. This patch fixes this problem. We do regex matching if and only if the column name contains special characters.
          Hide
          Namit Jain added a comment -

          Looks good - can you add a negative test where you are using regex columns in group by queries.
          After that, I will run test and commit if the tests pass

          Show
          Namit Jain added a comment - Looks good - can you add a negative test where you are using regex columns in group by queries. After that, I will run test and commit if the tests pass
          Hide
          Zheng Shao added a comment -

          Added a negative test for Group By regex_col.

          Show
          Zheng Shao added a comment - Added a negative test for Group By regex_col.
          Hide
          Namit Jain added a comment -

          +1

          looks good - will commit if the tests pass

          Show
          Namit Jain added a comment - +1 looks good - will commit if the tests pass
          Show
          Amr Awadallah added a comment - Very cool feature. BTW, for this comment: https://issues.apache.org/jira/browse/HIVE-420?focusedCommentId=12704795&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12704795 the regex is missing a ^
          Hide
          Zheng Shao added a comment -

          @Amr, I think we don't need ^ and $ for regex matching. We only need that for regex searching in case the searched regex has to start from the beginning of the string or the end of the string.

          For matching, the whole string has to be matched with the regex, so there is no need for ^ and $.

          What do you think?

          You can try my examples out at http://www.fileformat.info/tool/regex.htm

          Show
          Zheng Shao added a comment - @Amr, I think we don't need ^ and $ for regex matching. We only need that for regex searching in case the searched regex has to start from the beginning of the string or the end of the string. For matching, the whole string has to be matched with the regex, so there is no need for ^ and $. What do you think? You can try my examples out at http://www.fileformat.info/tool/regex.htm
          Hide
          Namit Jain added a comment -

          Committed. Thanks Zheng

          Show
          Namit Jain added a comment - Committed. Thanks Zheng
          Hide
          Amr Awadallah added a comment -

          @Zheng, good point, they are indeed redundant for match.

          Show
          Amr Awadallah added a comment - @Zheng, good point, they are indeed redundant for match.
          Hide
          Zheng Shao added a comment -
          Show
          Zheng Shao added a comment - A link for java regex quantifiers: http://java.sun.com/docs/books/tutorial/essential/regex/quant.html

            People

            • Assignee:
              Zheng Shao
              Reporter:
              Prasad Chakka
            • Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development