Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7524

Distinct on array with any_value

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.17.0
    • Fix Version/s: None
    • Component/s: Functions - Drill
    • Labels:
      None

      Description

      AS drill doesn't allow to GROUP BY nor DISTINCT nor ORDER BY complex type, it may appears as a solution to use any_value aggregate function to do some works.

      But some problems appears:

      With a dataset of 223664 rows like:

      SELECT Url, Tags FROM dfs.tmp.`data.json` LIMIT 1;
      +-----------------------------------------+--------+
      |                   Url                   |  Tags  |
      +-----------------------------------------+--------+
      | http://000.dijiushipindian.com/feed.rss | ["us"] |
      +-----------------------------------------+--------+
      

      With the own UDF function to_string that only do

      @Param FieldReader input;
      ...
      String rowString = input.readObject().toString();
      ...
      
      SELECT any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
      GROUP BY NULLIF(UPPER(to_string(T.Tags)),'') /* WORK WELL */;
      +--------+
      |  Tags  |
      +--------+
      | ["us"] |
      | ["cn"] |
      ...
      
      SELECT Url, any_value(T.Tags)Tags FROM dfs.tmp.`data.json`
      GROUP BY Url, NULLIF(UPPER(to_string(T.Tags)),'') /* NOK */;
        java.lang.NegativeArraySizeException
      

      Sometimes the error can be different (details in attachment): java.lang.IndexOutOfBoundsException: index: 1634787136, length: 7629168 (expected: range(0, 8388608))

      And before producing the error, the output show some results like below

      +----------------------------------------------------------------------------------+------+
      |                                       Url                                        | Tags |
      +----------------------------------------------------------------------------------+------+
      | http://everythiing4u.blogspot.com.es/2013/04/omg-proposal-fail.html              | []   |
      | http://everythiing4u.blogspot.com.es/2013/04/omg-this-dude-just-owned-his-friend.html | []   |
      

      And this result is not correct because field Tags is empty although this is never the case in the source file.

      So maybe there is a problem with the aggregate function any_value.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              benj641 benj
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: