Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-4366 Aggregation Improvement
  3. SPARK-4243

Spark SQL SELECT COUNT DISTINCT optimization

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.1.0
    • 1.6.0
    • SQL
    • None

    Description

      Spark SQL runs slow when using this code:

      val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
      val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
      parquetFile.registerTempTable("parquetFile") 
      val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
      count.map(t => t(0)).collect().foreach(println)
      

      But with this query it runs much faster:

      SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
      

      Old queries stats by phases:
      3.2min
      17s
      New query stats by phases:
      0.3 s
      16 s
      20 s

      Maybe you should also see this query for optimization:

      SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile 
      

      Attachments

        Activity

          People

            yhuai Yin Huai
            Bojan Kostic Bojan Kostić
            Votes:
            6 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: