[SPARK-19037] Run count(distinct x) from sub query found some errors - ASF JIRA

XML

Word

Printable

JSON

when i use spark-shell or spark-sql to execute count(distinct name) from subquery, some errors occur:

select count(distinct name) from (select * from mytest limit 10) as a

if i do this in hive-server2, i can get the correct result.

if i just execute select count(name) from (select * from mytest limit 10) as a, i can also get the right result.

besides, i found the same errors when i use distinct(),groupby() with subquery.

I think there maybe some bugs when doing key-reduce jobs with subquery.

I will add the errors in new comment.

besides, i test dropDuplicates in spark-shell:

1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show

it will throw some exceptions

2. spark.table("mytest").dropDuplicates("name").show

it will return the right result

duplicates

SPARK-18528 limit + groupBy leads to java.lang.NullPointerException