Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: PySpark, SparkR, SQL
    • Labels:
      None

      Description

      I create a simple dataframe in R and call the summary function on it (standard R, not SparkR).

      > library(magrittr)
      > df <- data.frame(
        date = as.Date("2015-01-01") + 0:99, 
        r = runif(100)
      )
      > df %>% summary
            date                  r          
       Min.   :2015-01-01   Min.   :0.01221  
       1st Qu.:2015-01-25   1st Qu.:0.30003  
       Median :2015-02-19   Median :0.46416  
       Mean   :2015-02-19   Mean   :0.50350  
       3rd Qu.:2015-03-16   3rd Qu.:0.73361  
       Max.   :2015-04-10   Max.   :0.99618  
      
      

      Notice that the date can be summarised here. In SparkR; this will give an error.

      > ddf <- createDataFrame(sqlContext, df) 
      > ddf %>% summary
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to data type mismatch: function average requires numeric types, not DateType;
      	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
      	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
      	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
      	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
      	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
      	at org.apache.spark.sql.
      

      This is a rather annoying bug since the SparkR documentation currently suggests that dates are now supported in SparkR.

        Activity

        Hide
        barrybecker4 Barry Becker added a comment -

        We would also like to have avg date aggregate work out of the box, but I suppose we could create a UDAF to work around it.

        Show
        barrybecker4 Barry Becker added a comment - We would also like to have avg date aggregate work out of the box, but I suppose we could create a UDAF to work around it.
        Hide
        shivaram Shivaram Venkataraman added a comment -

        Reynold Xin Yeah the idea here is to support operators like mean, median etc. on date, timestamp.

        Show
        shivaram Shivaram Venkataraman added a comment - Reynold Xin Yeah the idea here is to support operators like mean, median etc. on date, timestamp.
        Hide
        cantdutchthis Vincent Warmerdam added a comment - - edited

        It just occured to me that there is a very similar error with machine learning. In R you can pass a date/timestamp into a model and it will treat it as if it were a numeric.

        > df <- data.frame(d = as.Date('2014-01-01') + 1:100, r = runif(100) + 0.5 * 1:100)
        > lm(r ~ d, data = df)
        
        Call:
        lm(formula = r ~ d, data = df)
        
        Coefficients:
        (Intercept)            d  
         -7994.9971       0.4975  
        

        I'm not sure if Spark wants to have similar support but it may be something to keep in mind; the problem seems similar.

        Show
        cantdutchthis Vincent Warmerdam added a comment - - edited It just occured to me that there is a very similar error with machine learning. In R you can pass a date/timestamp into a model and it will treat it as if it were a numeric. > df <- data.frame(d = as.Date('2014-01-01') + 1:100, r = runif(100) + 0.5 * 1:100) > lm(r ~ d, data = df) Call: lm(formula = r ~ d, data = df) Coefficients: (Intercept) d -7994.9971 0.4975 I'm not sure if Spark wants to have similar support but it may be something to keep in mind; the problem seems similar.
        Hide
        rxin Reynold Xin added a comment -

        Is the idea here to support aggregation functions on date and timestamp?

        Show
        rxin Reynold Xin added a comment - Is the idea here to support aggregation functions on date and timestamp?
        Hide
        cantdutchthis Vincent Warmerdam added a comment - - edited

        I figured as such, it seemed natural to post it here though as it is a feature that many R users are used to.

        Show
        cantdutchthis Vincent Warmerdam added a comment - - edited I figured as such, it seemed natural to post it here though as it is a feature that many R users are used to.
        Hide
        shivaram Shivaram Venkataraman added a comment -

        Thanks for the report – I think this is a problem in the Spark SQL layer (so it should also happen in Scala, Python as well) as we don't support summarizing DateType fields

        cc Reynold Xin Davies Liu

        Show
        shivaram Shivaram Venkataraman added a comment - Thanks for the report – I think this is a problem in the Spark SQL layer (so it should also happen in Scala, Python as well) as we don't support summarizing DateType fields cc Reynold Xin Davies Liu

          People

          • Assignee:
            Unassigned
            Reporter:
            cantdutchthis Vincent Warmerdam
            Shepherd:
            Shivaram Venkataraman
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development