Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24496

CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Invalid
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Currently, JSON data source supports floatAsBigDecimal option, which reads floats as DecimalType.

      I noticed there are several restrictions in Spark DecimalType below:

      1. The precision cannot be bigger than 38.
      2. scale cannot be bigger than precision.

      However, with the option above, it reads BigDecimal which does not follow the conditions above.

      This could be observed as below:

      def simpleFloats: RDD[String] =
        sqlContext.sparkContext.parallelize(
          """{"a": 0.01}""" ::
          """{"a": 0.02}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("floatAsBigDecimal", "true")
        .json(simpleFloats)
      jsonDF.printSchema()
      

      throws an exception below:

      org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
      	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
      	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
      ...
      

      Since JSON data source infers DataType as StringType when it fails to infer, it might have to be inferred as StringType or maybe just simply DoubleType

        Attachments

        1. SparkJiraIssue08062018.txt
          0.5 kB
          SHAILENDRA SHAHANE

          Issue Links

            Activity

              People

              • Assignee:
                hyukjin.kwon Hyukjin Kwon
                Reporter:
                shahaness SHAILENDRA SHAHANE
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: