Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10520

Dates cannot be summarised

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • PySpark, SparkR, SQL
    • None

    Description

      I create a simple dataframe in R and call the summary function on it (standard R, not SparkR).

      > library(magrittr)
      > df <- data.frame(
        date = as.Date("2015-01-01") + 0:99, 
        r = runif(100)
      )
      > df %>% summary
            date                  r          
       Min.   :2015-01-01   Min.   :0.01221  
       1st Qu.:2015-01-25   1st Qu.:0.30003  
       Median :2015-02-19   Median :0.46416  
       Mean   :2015-02-19   Mean   :0.50350  
       3rd Qu.:2015-03-16   3rd Qu.:0.73361  
       Max.   :2015-04-10   Max.   :0.99618  
      
      

      Notice that the date can be summarised here. In SparkR; this will give an error.

      > ddf <- createDataFrame(sqlContext, df) 
      > ddf %>% summary
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType;
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
      	at org.apache.spark.sql.
      

      This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            cantdutchthis Vincent Warmerdam
            Shivaram Venkataraman Shivaram Venkataraman

            Dates

              Created:
              Updated:

              Slack

                Issue deployment