Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7524

Distinct on array with any_value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.17.0
    • None
    • Functions - Drill
    • None

    Description

      AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it may appears as a solution to use any_value aggregate function to do some works.

      But some problems appears:

      With a dataset of 223664 rows like:

      SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1;
      +-----------------------------------------+--------+
      |                   Url                   |  Tags  |
      +-----------------------------------------+--------+
      | http://000.dijiushipindian.com/feed.rss | ["us"] |
      +-----------------------------------------+--------+
      

      With the own UDF function to_string that only do

      @Param FieldReader input;
      ...
      String rowString = input.readObject().toString();
      ...
      
      SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
      GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */;
      +--------+
      |  Tags  |
      +--------+
      | ["us"] |
      | ["cn"] |
      ...
      
      SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
      GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */;
        java.lang.NegativeArraySizeException
      

      Sometimes the error can be different (details in attachment): java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 (expected: range(0, 8388608))

      And before producing the error, the output show some results like below

      +----------------------------------------------------------------------------------+------+
      |                                       Url                                        | Tags |
      +----------------------------------------------------------------------------------+------+
      | http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html              | []   |
      | http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html | []   |
      

      And this result is not correct because field Tags is empty although this is never the case in the source file.

      So maybe there is a problem with the aggregate function any_value.

      Attachments

        Activity

          People

            Unassigned Unassigned
            benj641 benj
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: