Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22236

CSV I/O: does not respect RFC 4180

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.2.0
    • None
    • Input/Output
    • None

    Description

      When reading or writing CSV files with Spark, double quotes are escaped with a backslash by default. However, the appropriate behaviour as set out by RFC 4180 (and adhered to by many software packages) is to escape using a second double quote.

      This piece of Python code demonstrates the issue

      import csv
      with open('testfile.csv', 'w') as f:
          cw = csv.writer(f)
          cw.writerow(['a 2.5" drive', 'another column'])
          cw.writerow(['a "quoted" string', '"quoted"'])
          cw.writerow([1,2])
      
      with open('testfile.csv') as f:
          print(f.read())
      
      # "a 2.5"" drive",another column
      # "a ""quoted"" string","""quoted"""
      # 1,2
      
      spark.read.csv('testfile.csv').collect()
      
      # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
      #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
      #  Row(_c0='1', _c1='2')]
      
      # explicitly stating the escape character fixed the issue
      spark.read.option('escape', '"').csv('testfile.csv').collect()
      
      # [Row(_c0='a 2.5" drive', _c1='another column'),
      #  Row(_c0='a "quoted" string', _c1='"quoted"'),
      #  Row(_c0='1', _c1='2')]
      
      

      The same applies to writes, where reading the file written by Spark may result in garbage.

      df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file correctly
      df.write.format("csv").save('testout.csv')
      with open('testout.csv/part-....csv') as f:
          cr = csv.reader(f)
          print(next(cr))
          print(next(cr))
      
      # ['a 2.5\\ drive"', 'another column']
      # ['a \\quoted\\" string"', '\\quoted\\""']
      

      The culprit is in CSVOptions.scala, where the default escape character is overridden.

      While it's possible to work with CSV files in a "compatible" manner, it would be useful if Spark had sensible defaults that conform to the above-mentioned RFC (as well as W3C recommendations). I realise this would be a breaking change and thus if accepted, it would probably need to result in a warning first, before moving to a new default.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ondrej Ondrej Kokes
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: