Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16512

No way to load CSV data without dropping whole rows when some of data is not matched with given schema

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 2.0.0
    • None
    • SQL
    • None

    Description

      Currently, there is no way to read CSV data without dropping whole rows when some of data is not matched with given schema.

      It seems there are some usecases as below:

      a,b
      1,c
      

      Here, a can be a dirty data in real usecases.

      But codes below:

      val path = "/tmp/test.csv"
      val schema = StructType(
        StructField("a", IntegerType, nullable = true) ::
        StructField("b", StringType, nullable = true) :: Nil
      val df = spark.read
        .format("csv")
        .option("mode", "PERMISSIVE")
        .schema(schema)
        .load(path)
      df.show()
      

      emits the exception below:

      java.lang.NumberFormatException: For input string: "a"
      	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
      	at java.lang.Integer.parseInt(Integer.java:580)
      	at java.lang.Integer.parseInt(Integer.java:615)
      	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
      	at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
      	at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
      

      With DROPMALFORM and FAILFAST, it will be dropped or failed with an exception.

      FYI, this is not the case for JSON because JSON data sources can handle this with PERMISSIVE mode as below:

      val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
      val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
      spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
      
      +----+
      |   a|
      +----+
      |   1|
      |null|
      +----+
      

      Please refer https://github.com/databricks/spark-csv/pull/298

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: