Description
Spark SQL runs slow when using this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") parquetFile.registerTempTable("parquetFile") val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") count.map(t => t(0)).collect().foreach(println)
But with this query it runs much faster:
SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
Old queries stats by phases:
3.2min
17s
New query stats by phases:
0.3 s
16 s
20 s
Maybe you should also see this query for optimization:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile