Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42335

Pass the comment option through to univocity if users set it explicitly in CSV dataSource

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0, 3.1.0, 3.2.0, 3.3.0
    • 3.5.0
    • SQL
    • None

    Description

      In PR https://github.com/apache/spark/pull/29516, in order to fix some bugs, univocity-parsers used by CSV dataSource was upgraded from 2.8.3 to 2.9.0, it also involved a new feature of univocity-parsers that quoting values of the first column that start with the comment character. It made a breaking for users downstream that handing a whole row as input.
       
      For codes:

      Seq(("#abc", 1)).toDF.write.csv("/Users/guowei/comment_test") 

      Before Spark 3.0,the content of output CSV files is shown as:

      After this change, the content is shown as:

      For users, they can't set comment option to '\u0000'  to keep the behavior as before because the new added `isCommentSet` check logic as follows:

      val isCommentSet = this.comment != '\u0000'
      
      
      def asWriterSettings: CsvWriterSettings = {
        // other code
        if (isCommentSet) {
          format.setComment(comment)
        }
        // other code
      }
       

      It's better to pass the comment option through to univocity if users set it explicitly in CSV dataSource.

       

      After this change, the behavior as flows:

      id code 2.4 and before 3.0 and after this update remark
      1 Seq("#abc", "\u0000def", "xyz").toDF()
      .write.option("comment", "\u0000").csv(path)
      #abc
      def
      xyz
      "#abc"
      def
      xyz
      #abc
      "def"
      xyz
      this update has a little bit difference with 3.0
      2 Seq("#abc", "\u0000def", "xyz").toDF()
      .write.option("comment", "#").csv(path)
      #abc
      def
      xyz
      "#abc"
      def
      xyz
      "#abc"
      def
      xyz
      the same
      3 Seq("#abc", "\u0000def", "xyz").toDF()
      .write.csv(path)
      #abc
      def
      xyz
      "#abc"
      def
      xyz
      "#abc"
      def
      xyz
      default behavior: the same
      4 Seq("#abc", "\u0000def", "xyz").toDF().write.text(path)
      spark.read.option("comment", "\u0000").csv(path)
      #abc
      xyz
      #abc
      \u0000def
      xyz
      #abc
      xyz
      this update has a little bit difference with 3.0
      5 Seq("#abc", "\u0000def", "xyz").toDF().write.text(path)
      spark.read.option("comment", "#").csv(path)
      \u0000def
      xyz
      \u0000def
      xyz
      \u0000def
      xyz
      the same
      6 Seq("#abc", "\u0000def", "xyz").toDF().write.text(path)
      spark.read.csv(path)
      #abc
      xyz
      #abc
      \u0000def
      xyz
      #abc
      \u0000def
      xyz
      default behavior: the same

       

      Attachments

        Activity

          People

            Wayne Guo Wei Guo
            Wayne Guo Wei Guo
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: