Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25890

Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.3.2
    • None
    • Spark Shell, SQL
    • None

    Description

      Reading a Ctrl-A delimited CSV file ignores rows with all null values. However a comma delimited CSV file doesn't.

      Reproduction in spark-shell:

      import org.apache.spark.sql._
      import org.apache.spark.sql.types._

      val l = List(List(1, 2), List(null,null), List(2,3))
      val datasetSchema = StructType(List(StructField("colA", IntegerType, true), StructField("colB", IntegerType, true)))
      val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
      val df = spark.createDataFrame(rdd, datasetSchema)

      df.show()

      colA colB
      1    2   
      null null
      2    3     

      df.write.option("delimiter", "\u0001").option("header", "true").csv("/ctrl-a-separated.csv")
      df.write.option("delimiter", ",").option("header", "true").csv("/comma-separated.csv")

      val commaDf = spark.read.option("header", "true").option("delimiter", ",").csv("/comma-separated.csv")
      commaDf.show

      colA colB
      1    2   
      2    3   
      null null

      val ctrlaDf = spark.read.option("header", "true").option("delimiter", "\u0001").csv("/ctrl-a-separated.csv")
      ctrlaDf.show

      colA colB
      1    2   
      2    3   

       

      As seen above, for Ctrl-A delimited CSV, rows containing only null values are ignored.

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            lakshminarayan.kamath Lakshminarayan Kamath
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: