Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25251

Make spark-csv's `quote` and `escape` options conform to RFC 4180

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.0, 2.3.1, 2.4.0, 3.0.0
    • None
    • Input/Output
    • None

    Description

      As described inĀ RFC-4180, page 2 -

         7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
      

      That's what Excel does, for example, by default.

      Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (). To fix this you have to explicitly tell Spark to use doublequote to use for as an escape character:

      .option('quote', '"') 
      .option('escape', '"')
      

      This may explain that a comma character wasn't interpreted as it was inside a quoted column.

      So this is request to make spark-csv reader RFC-4180 compatible in regards to default option values for `quote` and `escape` (make both equal to " ).

      Since this is a backward-incompatible change, Spark 3.0 might be a good release for this change.

      Some more background - https://stackoverflow.com/a/45138591/470583

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Tagar Ruslan Dautkhanov
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: