Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10520

Dates cannot be summarised

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • PySpark, SparkR, SQL
    • None

    Description

      I create a simple dataframe in R and call the summary function on it (standard R, not SparkR).

      > library(magrittr)
      > df <- data.frame(
        date = as.Date("2015-01-01") + 0:99, 
        r = runif(100)
      )
      > df %>% summary
            date                  r          
       Min.   :2015-01-01   Min.   :0.01221  
       1st Qu.:2015-01-25   1st Qu.:0.30003  
       Median :2015-02-19   Median :0.46416  
       Mean   :2015-02-19   Mean   :0.50350  
       3rd Qu.:2015-03-16   3rd Qu.:0.73361  
       Max.   :2015-04-10   Max.   :0.99618  
      
      

      Notice that the date can be summarised here. In SparkR; this will give an error.

      > ddf <- createDataFrame(sqlContext, df) 
      > ddf %>% summary
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType;
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
      	at org.apache.spark.sql.
      

      This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR.

      Attachments

        Activity

          People

            Unassigned Unassigned
            cantdutchthis Vincent Warmerdam
            Shivaram Venkataraman Shivaram Venkataraman
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: