Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24496

CLONE - JSON data source fails to infer floats as decimal when precision is bigger than 38 or scale is bigger than precision.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Invalid
    • None
    • 2.0.0
    • SQL
    • None

    Description

      Currently, JSON data source supports floatAsBigDecimal option, which reads floats as DecimalType.

      I noticed there are several restrictions in Spark DecimalType below:

      1. The precision cannot be bigger than 38.
      2. scale cannot be bigger than precision.

      However, with the option above, it reads BigDecimal which does not follow the conditions above.

      This could be observed as below:

      def simpleFloats: RDD[String] =
        sqlContext.sparkContext.parallelize(
          """{"a": 0.01}""" ::
          """{"a": 0.02}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("floatAsBigDecimal", "true")
        .json(simpleFloats)
      jsonDF.printSchema()
      

      throws an exception below:

      org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
      	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:59)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1$$anonfun$apply$3.apply(InferSchema.scala:57)
      	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2249)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:57)
      	at org.apache.spark.sql.execution.datasources.json.InferSchema$$anonfun$1$$anonfun$apply$1.apply(InferSchema.scala:55)
      	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:742)
      ...
      

      Since JSON data source infers DataType as StringType when it fails to infer, it might have to be inferred as StringType or maybe just simply DoubleType

      Attachments

        1. SparkJiraIssue08062018.txt
          0.5 kB
          SHAILENDRA SHAHANE

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              shahaness SHAILENDRA SHAHANE
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: