Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Query Processor
    • Labels:
      None

      Description

      Moving the discussion to a new jira:

      I've implemented group_cat() in a rush, and found something difficult to slove:
      1. function group_cat() has a internal order by clause, currently, we can't implement such an aggregation in hive.
      2. when the strings will be group concated are too large, in another words, if data skew appears, there is often not enough memory to store such a big result.

        Issue Links

          Activity

          Hide
          Jian Wang added a comment -

          Svetozar Misljencevic
          I use this concat_ws(' ', map_keys(UNION_MAP(MAP(your_column, 'dummy')))) method instead of group_concat,but I got a error like this

          
          

          FAILED: SemanticException [Error 10011]: Line 172:30 Invalid function 'UNION_MAP'

          {/code}

          should I add some jars ?

          Show
          Jian Wang added a comment - Svetozar Misljencevic I use this concat_ws(' ', map_keys(UNION_MAP(MAP(your_column, 'dummy')))) method instead of group_concat,but I got a error like this FAILED: SemanticException [Error 10011] : Line 172:30 Invalid function 'UNION_MAP' {/code} should I add some jars ?
          Hide
          Svetozar Misljencevic added a comment -

          You could use the currently undocumented union_map function as a workaround... Try :
          concat_ws(' ', map_keys(UNION_MAP(MAP(your_column, 'dummy'))))

          Show
          Svetozar Misljencevic added a comment - You could use the currently undocumented union_map function as a workaround... Try : concat_ws(' ', map_keys(UNION_MAP(MAP(your_column, 'dummy'))))
          Hide
          guyan added a comment -

          hi all, this issue will be resolved?

          Show
          guyan added a comment - hi all, this issue will be resolved?
          Hide
          Jeff Hammerbacher added a comment -

          Hey,

          Given that this JIRA has been opened three separate times, and that I have received a recent request for it in IRC, I think it would be worth bumping to near the top of the queue.

          Thanks,
          Jeff

          Show
          Jeff Hammerbacher added a comment - Hey, Given that this JIRA has been opened three separate times, and that I have received a recent request for it in IRC, I think it would be worth bumping to near the top of the queue. Thanks, Jeff
          Show
          Jim Blomo added a comment - Also discussed in http://www.mail-archive.com/hive-user@hadoop.apache.org/msg02854.html
          Hide
          Zheng Shao added a comment -

          Shall we do the follow-up jira to provide the capability before we do this?
          Without the follow-up jira, the group_concat is non-deterministic, so it's harder to write a test.

          Show
          Zheng Shao added a comment - Shall we do the follow-up jira to provide the capability before we do this? Without the follow-up jira, the group_concat is non-deterministic, so it's harder to write a test.
          Hide
          Namit Jain added a comment -

          I agree - but I think it can be done in a follow-up jira, since the two are kind of independent

          Show
          Namit Jain added a comment - I agree - but I think it can be done in a follow-up jira, since the two are kind of independent
          Hide
          Ashish Thusoo added a comment -

          Found the JIRA...

          Since group by is done in the reducer you could just use the trick that is used in

          distribute by x sort by y

          when we do MAP and REDUCE operators. By setting up reduce sink in a similar way you would be able to ensure that each reducer gets the rows for a value of x in the sorted order of y. You can look at how we generate plans for the transform operator and use the same strategy in group by code.

          That should work and of course in this case we have to turn of any map/side aggregation?

          Show
          Ashish Thusoo added a comment - Found the JIRA... Since group by is done in the reducer you could just use the trick that is used in distribute by x sort by y when we do MAP and REDUCE operators. By setting up reduce sink in a similar way you would be able to ensure that each reducer gets the rows for a value of x in the sorted order of y. You can look at how we generate plans for the transform operator and use the same strategy in group by code. That should work and of course in this case we have to turn of any map/side aggregation?
          Hide
          Namit Jain added a comment -

          I agree 1. is a problem - we dont have any good way to handle the order by.

          A workaround would be to first sort the results in a sub-query and then have the group_concat outside the sub-query. But, syntactially, it is more painful.
          Is there a requirement for the ordering ?

          Separator can be handled as a configuration parameter like the maximum length.

          Other option is to treat group_concat() specially, all the way from the parser - I dont like it because it is kind of hacky.
          I think, first we should find out whether anyone needs ORDER BY in the group_concat and then think about it.

          Show
          Namit Jain added a comment - I agree 1. is a problem - we dont have any good way to handle the order by. A workaround would be to first sort the results in a sub-query and then have the group_concat outside the sub-query. But, syntactially, it is more painful. Is there a requirement for the ordering ? Separator can be handled as a configuration parameter like the maximum length. Other option is to treat group_concat() specially, all the way from the parser - I dont like it because it is kind of hacky. I think, first we should find out whether anyone needs ORDER BY in the group_concat and then think about it.
          Hide
          Namit Jain added a comment -

          I am assuming you mean: group_concat() in mysql

          2. should not be a probelm:

          From MySQL documentation,

          The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024. The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet. The syntax to change the value of group_concat_max_len at runtime is as follows, where val is an unsigned integer:

          SET [GLOBAL | SESSION] group_concat_max_len = val

          We can also add a new configuration parameter for the maximum length

          Show
          Namit Jain added a comment - I am assuming you mean: group_concat() in mysql 2. should not be a probelm: From MySQL documentation, The result is truncated to the maximum length that is given by the group_concat_max_len system variable, which has a default value of 1024. The value can be set higher, although the effective maximum length of the return value is constrained by the value of max_allowed_packet. The syntax to change the value of group_concat_max_len at runtime is as follows, where val is an unsigned integer: SET [GLOBAL | SESSION] group_concat_max_len = val We can also add a new configuration parameter for the maximum length

            People

            • Assignee:
              Min Zhou
              Reporter:
              Namit Jain
            • Votes:
              5 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development